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1. Preliminaries 
1.1 Introduction 

A study of intermittent/ transient faults (I/T faults) 
in digital systems requires a confrontation with a. multitude 
of issues. It soon becomes clear, moreover, that in the 
period of one year, not all the issues can be considered. 

We have, therefore, taken one approach - of many possible 
approaches - to this area, and have reached a point which we 
feel indicates reasonable progress. In the following sections 
we will detail the state of our ^.westigation, and we will try 
to indicate the strong and weak points of our current position 
with this problem area. This section will serve as an over- 
view of these results. 

The ultimate goal of this study is to perform survivability 

evaluation of digital systems for I/T faults. The framework 

within which this evaluation is to be performed is generically 

described by the CAKE II approach. However, survivability has 

heretofore been addressed primary from the point of view of 

long-term or uniform survivability. The explicit consideration 

of I/T faults requires instead the consideration of interval 

survivability. Interval survivability is a measure of the 

probability of the system suarviving a fixed time interval of 

I/T fault activity in the sense that at the end of, this interval 
• \ 

the system can continue to operate (perhaps at the cost of some 
recovery operations) in an acceptable (but perhaps degraded) 
mode. 
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We have not, at this time, implemented a means of evalu- 
ating such a survivability number. It is clear, however, that 
the task of doing so is at least as complicated as that of 
writing the actual CAKE II program. Moreover, to do so is 
beyond the scope of the work reported here, and would, in fact, 
overlap significantly with the development of CARE III. How- 
ever, crucial to any such evaluation is the capability to 
detect and diagnose I/T faults. (This is sometimes referred 
to as the "D" and ”1” of. the DIR function). We have here, 
then, the motivation for much of the work to be reported in 
the following: the chief results to be reported will consist 

of the development of methodology for detecting and diagnos- 
ing I/T faults in digital systems . We will show that there 
are specific bounds and guidelines to this detection and diag- 
nosis for I/T faults in general which must be taken into" 
account in any interval survivability evaluation. These bounds 
and guidelines are detailed such that they can be incorporated 
into any testing and diagnosing methodology. 

In addition we report on the status of an experimental 
attempt to determine the effect of various physical I/T faults 
(for example, those which might be determined to be important 
from the 1977 Lear jet Experiments) . on a type of module which 
could be a part of a general computer system characterized by 
the CARE II model. The goal here is to functionally determine 
the effect of various l/T faults so that the methodologies for 
testing and diagnosing I/T faults can be- fully exploited by 


taking into account the detailed test set requirements. 

All of this, then leads to further refinements in 
interval survivability evaluation for digital systems in 
the presence of I/T faults. 
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2. Interval Survivability 

Our objective is to enhance and evaluate the survivability 
of a computer system by use of fault tolerant techniques. 
Survivability is evaluated as the probability that the system 
will survive until a given time, t . The concept of survive" 
bility is extended by the inclusion of degraded modes of opera- 
tion. For example, a system may survive to time t in the 
undegraded mode but survive from time t to time t in a 
degraded mode. 

Various methods of increasing the fault tolerance of a 
computer system have been proposed in the literature. The 
method considered here is the use of standby sparing. The 
computer system is assumed to be partitioned into several 
stages. Within each stage are a number of identical miits. 

A certain number of the units in a given stage are in active 
operation while several other xmits serve as standby units. 

In the event that an active unit fails, a spare unit is 
tested and then switched in to replace the bad active unit. 

The principles of detection and. location of faults and 
of recovery of the program in progress are essential to a 
gracefully degrading, standby sparing computing system. A 
system of both hardware and software detectors are provided 
to detect the presence of any faults and then to locate the 
fault to within a certain category . A strategy for recovering 
from faults in each category must be provided. When the pre- 
sence of a fault is detected, further propagation of errors 
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is prevented by stopping the program in progress and holding 
information needed for recovery. After the fault is, isolated 
to within a certain category and a spare unit switched in to 
replace the faulty unit (no spare unit is switched in for the 
cases of transient faults) a recovery strategy is then 
carried out to restore the program in progress. A failure to 
detect or isolate a fault or a failure to restore the program 
in progress will result in a system failure. 

In order to design for fault tolerance, a great deal 
must foe known about the nature of the faults that can occur. 

A transient fault is generated internal to the affected units? 
therefore one must be able to identify . transient faults so 
that recovery may be done without replacing any units with a 
spare unit- An intermittent fault is a failure that is gen- 
erated within a unit. If a fault is identified as iaitermittent, 
it is best to treat the fault as a permanent fault, i.e,, 
replace the unit with a spare unit to avoid the consequences 
of reoccurance of the intermittency , In order to detect faults 
as quickly as possible, one must have a good model of the proba- 
bility of occurance of each type of fault that may occur. Using 
an accurate fault model, one then designs hardware detectors 
and software procedures to detect the various faults in an 

efficient manner. For permanent faults, accurate failure rates 

\ 

are given by the manufacturer. However, additional study is 
required to model intermittent faults, which are generally 
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caused by loose wire bonds in integrated circuits 7 and 
transient faults, which are generally the result of electro- 
magnetic disturbances generated in a thunderstorm by nearby 
lightning strikes. 

The CAKE II program is designed to evaluate the sur- 
vivability of the gracefully degrading and standby sparing 
computing system described above. The program derives the 
probability that the system can survive until time t . 
Information required for CARE II includes failure rates of 
the units and rate of occurance of nonrecoverable , trans- 
ient faults. The CARE II program includes a coverage model 
which demands a detailed knowledge of what faults may occur, 
the characteristics of the hardware and software detectors, 
and probability of recovery given the class of fault and 
the time elapsed from the occurance of the fault to its - 
detection and to its isolation. Coverage is the part of the 
fault tolerance design that includes detection, isolation, 
and recovery from a fault. Once the fault is. detected by a 
given detector, the isolation and recovery procedures follow 
according to a deterministic path. However, several detectors 
may be capable of detecting the same fault and which one 
actually succeeds will then determine the isolation and 
recovery procedures. Due to the nondeterministic-natiure of 
fault detection, CARE II requires categorization of faults 
into fault classes. Fault classes are chosen such that the 
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detection of a fault class by the set of detectors is statis** 
tically independent with respect to the detectors. To derive- 
coverage coefficients, i.e., the probability of recovering 
from a fault given that a certain number of spare units must 
be tested before a good one is found, one must also know the 
probability of occurance of the various faults. Finally, 
one must also know the effectiveness of the recovery strategy 
for a given fault and given time delays included in detection 
and isolation. Specifically one must provide r(‘-f,T') where 
r(x, T*) is -the probability for a given type of fault. -that -the 
program will recover after a delay of x seconds from fault 
^ occurance to fault detection and a delay of t’ seconds from 
fault detection to fault isolation. CARE II assumes that 
r(T,x') = r* (t) + r' M'T+ t ' ) where r' (x) accounts for propa- 
gation of errors before the fault is detected and r’ ' (x+x^) 
accounts simply for total time lost in the. running of the 
program, - 

The present literature on gracefully degrading and 
s-tanciby sparing computing systems assume a uniform failure 
rate. The problem of lightning induced fualts brings up 
the question of interval survivability, that is design and 
evaluation of the computing system with respect to tolera- 
tion of a severe transient fault inducing environment over a 
limited period of time. It will be necessary to modify the 
, CARE II model to account for a change in detection and recovery 


procedures for the severe transient interval. The emphasis 
of the detection procedure must be shifted to detection of 
the transient faults likely to occur in a thunderstorm. 

Such a shift may easily be initiated by the detection of 
electromagnetic disturbances by an external detector provided 
for such a purpose. The external detector may provide data 
on the nature of the transient environment so that the opti-“ 
mum fault tolerant strategy may be employed. For example, the 
fault tolerant strategy should be a function of the severity 
of each electromagnetic disturbance as it occurs and also the 
expected length of the severe transient fault inducing inter- 
val, If several operations are being carried out at once, 
those operations that are most affected by the transient en- 
vironment should be postponed, if possible, or done in a way 
that is less sensitive to transient faults. To study the modi- 
fication of gracefully degrading and standby sparing computer 
systems for interval survivability, one requires, an extensive 
knowledge of the effects of a thunderstorm on the operation of 
the computing system and a knowledge of the effects of. lightning 
induced transient faults on the various functions of the com- 
puting system. 
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3. Current; Status of Module Self-Diagnosis Theory 
3 . 1 Introduction 

In this section we will give an overview of the theory of 
self-diagnosis of digital systems. It should be kept in mind that 
while our concern is with I/T faults, the majority of this reviewed 
work deals with permanent faults. However, this work must nevertheless 
be appreciated as it represents an initial Condition on the results 
of Section 4. Moreover, from Section 3 we have seen that for 
survivability evaluation, self-diagnosis is a crucial factor. 

In particular, we will consider the design of such systems 
which will operate in a distributed processing/decentralized control 
mode. In order for such systems to operate in a fault tolerant 
enrivonment, it is imperative to incorporate into the. design some 
degree of self-diagnosabllity. 

In this section, a system is considered which can be partitioned 
into n functional units, or modules, each possessing some degree of 
intelligence. A major assumption to be made is that each module be 
completely capable of testing the correctoess of other specified 
modules in the system- Such tests cannot be well defined in the 
general sense. (Indeed, their generation is the responsibility of 
the module designer) . They may be considered .to be strictly hardware, 
’as in the case of dedicated hardware monitor. Similarly, a module 
may utilize various diagnostic software routines to generate and 
compare test patterns that will be applied to the module being tested. 
In the most general sense however , a test may be thought of as any 
combination of hardware and software which enables a unit to success- 
fully base a conclusion as to the operational state of another module. 
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A unique test will probably need, to be designed for each module 
based on the fact that a single universal test would not provide 
adequate fault coverage for each module, due to their functional 
differences. 


> 
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It should be noted that the implementation of such tests is not a 
trivial matter and poses a major . obstacle to the design of totally 
se3 f-diagnosable systems - 

The actual diagnosis problem is one of detection and location ' 
of all faulty modules that may be present in the system. Once all. 
of the tests in the testing interconnection design have been com- 
pleted, it is the function of the entire system to detect the 
presence of any single or multiple faults. A system diagnosis 
algorithm must be implemented, to examine the set of test outcomes 
and determine if any faults have occurred. A factor which enhances 
the complexity of the problem is that of the possibility of incorrect 
test outcomes produced by modules which are themselves faulty. 

A basic assumption affecting the testing interconnection of 
devices is one of generating an upper bound on the allowable number 
of modules which may become defective at. any one time. Depending 
upon this figure, the testing scheme may be very simple or quite 
complex.. It will be shown that the number of modules in tlis system 
and their testing interconnections have a direct dependence upon 
this upper bound. 

The goals of atta5*ning a totally self-dignosable system are 
two fold. The first is in enabling the system to operate in a 
fault tolerant environment. Upon the detection of any module 
failures, the system could conceivably isolate those.*, devices, 
reconfigure and then recover to resume its processes, all without 
any external communications. Such performance would be essential . 
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in situations where it would be disaster ous for the entire system: 
to crash. The second goal of attaining self-diagnosability would 
be to ease system maintenance. The detection of any faults in the 
system could signal an error condition. This in turn could affect 
a physical replacement of the faulty components, if possible, to 
update the system to resume correct operation. Thus, in any situa- 
tion, self-dignosable systems are easily seen to be a major 
consideration in increasing system reliability. 


) 


<: 
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3>2 SYSTEM MODELS ^330 THEIPs AMAIiYSIS 

Various system models have been proposed to aid in the analysis 
of diagnosable systems. The basic goal of each of the models is to 
demonstrate the testing interconnections employed and to deduce the 
performance characteristics of such schemes. These characteristics 
may include an upper bound on the number • of faults that may occur 
in the system, while at the same time possess the ability to locate 
just a single fault or perhaps diagnosis the entire fault situation. 

Probably the most well known had been proposed by Preparata [3.7] 
a decade ago. In this model the system is decomposed into n different 
subsystems, or modules, each with the capacity to test the correct- 
ness of the others. The model itself is a graph-theoretic one in 
which each of the n modes represents the n modules and a directed 
edge is included to denote a testing interconnection between two 
modules. Each of the testing links is represented by b^^^ in which 
each unit U* evaluates unit U . . The weight associated with each 

-1- j 

bij is a^i^j The test outcomes a^^ is a 0 if module is 

fault free and tests tr. to be also fault free . 'However, if TJ- is 

J ^ 

fault free and tests to be faulty, then a- . = 1. In the situa- 

tion where the testing module is faulty, its test output could 

possibly be a 0 or 1, regardless of the actual condition of the 

module Uj being tested. The testing connection of system can be 

represented by a connection matrix C = where s 

n if b^j exists. 

*^ij ~(p if bj^j does not exist* 

Once all of the tests in the model have been completed, each a* ^ 
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has heen assigned a corresponding -binary weight. It is from this 
set of test outcomes, i.e. the system syndrome, that the system 
diagnostics will he performed. 

The system diagnostics may he oriented towards one of two 
possible approaches. The first is often referred to as one-step 
t-fault diagnosibility- Is the goal of this method to identify 
(locatey all of the faulty units in the system, gi^^en its syndrome. 

A necessary constraint is that the number of faulty units does not 
exceed the upper, bound t. Another approach is in viewing the system 
as being sequentially t-fault diagnosable. Here, it is guaranteed 
that at least one faulty unit can be detected directly from analysis 
of the system syndrome- Again, it is assumed that the maximum 
number of faulty modules does not exceed t. It is obvious that a 
one-step t-fault diagnosable system is also sequentially t-fault 
diagnosable. The 'motivation behind each of these situations should 
also be clear. In the one-step case, the syndrome is examined to 
identify each of the faulty units. These units may all be replaced 
at once, thus enabling the system to once again become fully opera- 
tional- On the other hand, in the sequentially t-fault diagnosable 
situation, the syndrome is examined such that only one faulty unit 
is detected. This faulty unit is then replaced and a new syndrome - 
is generated. This procedure would be reinterated up to t times, 
until all of the faulty units had been replaced. While one-step 
t-fault diagnosability is more efficient than sequential t-fault 
diagnosabil.ity, the complexity involved may not warrant the increased 
performance . 
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In his paper, preparafca made a few basic, but very important 
observations in relation to diagnosable systems. The first was in 
generating a lower bound on the number of units to ensure the 
system to be diagnosable. That is, given that a system is one-step' 
t-fault diagnosable, then n^2t+b Conversely, if a system of n modules 
is said to be one-step t-fault diagnosable, then an upper bound on 


It should be emphasised 
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its degree of diagnosability is t - 1 ' 2~ I • 
that these bounds may not be reached if an inefficient connecting 
scheme is employed, i^other iraportant observation made was in 
bounding the smallest number of units needed to test another to 
ensure the system to be one-step t-fault diagnosable. It was found 
that it was necessary that each unit be tested by at least t other 
units- Thus, we see that for a system of n units, a minimum number 
of connections, that enable one-step t-fault diagnosability is N = nt 
links. 

An interconnection design in which n = 2t 1 and each unit is 
tested by exactly t other units in such a way as to be one-step 
t-fault diagnosable is said to be optimal. A well known class of 
optimal designs is the so-called design. In this design, a 
testing link from to Uj exists if and only if j - i = ^m (modulo n) 
and ra assumes, the values t,2,...,t. Ezcamples of ^22 

with n = 5, are shown in Figure 3,1. It was shown that the design 
is an optimal one whenever ^ and t are relatively prime, and as such, 
would allow the system designer to employ a most efficient testing 
interconnection scheme into the overall design. Also, this type of 
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testing interconnection between devices provides the ability to 
synthesize an efficient diagnostic algorithm, snch as the one 
proposed by Meyer and Masson [3.S] * (This algorithm is presented 
in detail in the Appendix.) 

As had been already pointed out, the complexity of a one-step 
t-fault diagnosable interconnection scheme leads to a rather large 
number of testing links. It is for this reason that sequential 
t-fault diagnosable systems have been studied- Since we are 
utilizing the same model as before, the lower bound on the number 
of units, n^2ttl, is still valid. However, Preparata showed that 
there exists a class, of designs with the number of testing links, 

N = n+2t- 2, such that the resulting s'''stem is sequentially 
t-fault diagnosable- Essentially, this design was that of a simple 
loop along with a subset of 2t - 2 units all testing a common unit. 
Such a testing interconnection scheme is shown in Figure 3. 2 with 
n = 14 and t — 6- 

The simplest sequential t-fault analysis is through the use, 
of a single loop system. With this interconnection, a lower bound 
on the number of units to guarantee sequential t-fault diagnosability 
is given by the following: 

n — V = 1 t (m + I ) ^ t X (m + I ) 

with t = 2m t X , m integral and X 0,1 

A table comparing one-step t-fault diagnosis to sequential t-fault 

diagnosis with respect to their lower bounds on the number of units 

and testing links is given in Figure 3,3. Upon examination of the 

table, it is obvious that as the allowable number of faults increases 

sequential t-fault diagnosis becomes much more cost effective in 


relation to the hardware needed to realize the design. 

As one of the first to address the problem, Preparata has 
been shown to raa'ke a few important initial contributions to the 
study of diagnosable systems. Among them have been the concepts 
of one-step and sequential t-fault diagnosis, along with each of 
their respective lower bounds or the number of units and testing 
interconnections. Also,, a class of optimal designs were proposed. 
What was noticeably missing however, was the means to determine 
the diagnosibility number, i.e. the maximum number of allowable 
faulty units, of a general interconnection scheme. 

To this end, it was necessai:y to be able to fully characterize 
the connection assignment of diagnosable systems. HaTcimi and Amin 
[3. 4] essentially piched up the study where Preparata left off, in 
that they claimed to have shown both the necessary and sufficient 
conditions for a system to be t- diagnosable. In their temino- 
logy, a system that is t-diagnosable is directly analagous to 
Preparata *‘s one-step t-fault diagnosis. The model used is exactly 
the same as the one that had previously been considered. That is, 
the system in question can be viewed as a directed graph explicitly 
showing the testing connections between modules - 

One of the main results of Halcimi and Amin was the determination 
of necessary and sufficient conditions for a system to be t-diagnos- 
able when the system has the property that no two units test each 
other. Very simply, they found that the system is t-diagnosable only 
if each unit is tested by at least t other units. As an example. 
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it xs seen that the design, which is known to he t-diagnosahle, 
satisfies the above conditions. 


What was obviously needed, however was a general characteriza- 
tion of any system, regardless of the interconnection scheme- The* 


following approach was proposed. Consider a dipath from to TJ* 

^1 -k 




in which a sequence of vertices and edges in G (the graph) , 

l^i- * ^^ii * "^i^^ * ^in ^ • - • r f ^ exists, where (tT-j U- ) 

^1 -^1 2 . 3.2 

denotes a directed edge from to When there is a dipath 

f rom U£ to Uj in G, then Uj is said to be reachable from G is 

defined as being strongly connected if any pair of vertices are 
raitally reachable. Hakiiai and Amin claimed the following: Xf 

n^2t+l and K(G)^t, then the system is t-diagnosable, where the con- 
nectivity K (G) of a digraph G is the minimum number of vertices 
whose removal from G yields a graph that is not strongly connected - 
It is claimed that whenever these conditions are met by any system, 
then the system diagnosibility number can be verified. 

An example, which questions these results is shown in Figure 3.4, 

By inspection, it is obvious that the graph as shown is not strongly 
connected. By considering and it is seen that a dipath from 
to does not exist- Therefore, the connectivity of this system 
is K(G) = 0, which implies that the system is o-diagnosable. A 
descrepancy evolves here in that it is felt that the system is 
instead 1-diagno sable, or equivalently, sequentially^ 1-fault diagnos— 
able. It will be shown in the next discussion of Russell and Kime's 
model, that the above system is indeed sequentially 1-fault diagnosable. 
Thus we see that given a general interconnection scheme, necessary . 
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and sufficient conditions have heen proposed by Hakimi and Amin to 
determine the degree of the systems diagnosibility^ although at the 
present they do seem to be questionable. 

Another diagnostic model has been proposed by Russell and Kime - 
which tends to be somewhat more general that the ones previously 
studied. In this model a system can be represented by either the 
G array approach S-^,T,F,Gj or by the diagnostic graph approach 
S- (s?,T,F,G,). Common to both is the set ^ which is the set of a 
possible faults that may occur in the system. Thus,®^ = f^,'f2» • • *^^n] 
Now, if we consider all of the possible 2^ possible subsets of 
3», ^ where F^ Ck^l, 2, . , . , 2^) is one of the allowable 

fault patterns that may occur in the system. The entire class of 

fault patterns can be represented by a 2^ x n F array as fallows; 

1 * 2 ' * • 


Fj.«= 1 4:^ f.e 


The set T=(t-j^, t 2 ,.--,tp) represents the p pass-fail tests that H 

can be applied to S. A test tj is a complete test for a fault f^^ if 
and only if a) tj always fails for f^ alone and b) tj always passes 
for no faults. Thus, we see that the set of tests which are complete 1 

for a fault pattern F^ is given as t(F^) ^ t (f^l .U t(f j'):pi. t-(f|^) , where ‘ 

f^, An invalid test set T (F^) is defined as the set of 

'■V ■ ■. 

tests that may not correctly specify the nature of the system in the 

Tc ' ' "V 

presence of a fault pattern F . That is, the test outcomes may be- 
come unreliable in the case where a) tj might pass if tj^ t(F^) and 
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b) tj might: pass if tj(^ t (F ) . This leads us to the notion of a 

K 

valid test set. In the presence of a fault pattern F , a valid 

K 

test is one in which a) tj always fails if t.6 t(F ) and b) tj 

I K 

always passes if t j t (F ) . It can be shown that the valid test 
set for a fault pattern can be derived as follows? Valid tests 




Up to this point we have completely described those parts of 
the model which are common to both the diagnostic graph and 6 array 
approaches- In the G array, or generalized fault table, the set of 
test outcomes, or syndromes, is represented by a 2^ x p matrix with 
the following structure: 


G- 




t j # t (F^) 
tj ^t(F^) 

t j £. T (P^) 


a.nd t j T (F^) (i.e. 
and t j ^ T (F^) (i-e* 

(i-e. don^t know) 



always passes) 
always fails) 


To aid in visualizing the system, a diagnostic graph -can be 
drawn. This graph is essentially a digraph in which each allowable 
single fault in the system is represented by a vertex of the graph. 

A directed edge from f^ to f j represents the situation in which f ^ 
being faulty invalidates test that is complete for f j . Thus, we see 
that according to this model, the set of complete tests for a fault 
is just the set of all incoming edges to it- In the presence of a 
fault pattern F , the set of invalid tests is the set of all out- 
going tests of f^S An example of a diagnostic graph of a given 

system is shown in Figure 3 - 5 - Notice that this model, differs signi- 
ficantly from the one previously discussed in that more than one 
unit may be . needed to perform a test. Therefore, it is evident that 
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this type of model allows a finer partitioning of the system- For 
example, both Data Channel (Cl) and Memory (Ml) act in conjunction 
with each other in testing the ROM Control (Rl) . 

This model as proposed is much more efficient than the one 
introduced by Preparata due to the fact that it considers much 
more general systems. In a normal design, a fault may invalidate 
a test that would be performed. In such cases,, the system could be 

• • M * 

viewed to be morphia, that is, T(F^UF^) = T(F^) UT(P^). The 
present model, however, also enables special systems to be - repre- 
sented- As an example, suppose there exists a design in which two 
or more simultaneous faults are necessary in order to invalidate 
a test. Such would be the situation in a triple modular redundant 
design. These type of systems would be semi-raorphic systems in 
which T (F U T(F^) U T(F^) . Semi-morphic systems could be 

modelled as above by maTcing a morphic approximation. Thus, we see 
that the diagnostic model co-'ers a rather large class of system 
designs. 

In their first paper. Russell and Kime [3.8]define diagnosa— 

bility with repair to be exactly the same as sequential t-fault 

diagnosis. In order to derive the conditions for diagnosability, 

the concept of a closed fault pattern was studied. A closed fault 

pattern F^ is one in which every test for each fault in is 

invalidated in the presence of F^. Note that this is analagous to 

\ 

the classical masTcing ideas of combinational logic circuits. The 
system closure index C(S) is the cardinality of the smallest closed 
fault pattern in the system- It is with this index that a necessary 
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condition, is given for- the system to he t-fault diagnosahle with ? 

repair. It is stated as xollows; 

It was also shown that for t = 1/2,3 the above is both necessary 
and sufficient. How, recall that in Figure 3. 4 the system was 
deemed to 0 - fault diagnosahle according to Hakimi and Amin. 

Since 1 - fault diagnosis is equivalent to a system which is 1 - 
fault diagnosahle with repair, we can also use Hussell and Kime's 
results to determine the diagnosability of the hypothetical system.. 

Upon examination of Figure 4, it is seen the smallest closed fault 
pattern is ^^i^ ^3^ therefore C(S) = 3. Since C(S) = 

3 - 2t tl, this implies that the system diagnosibility number t = 1- 
Thererore, it appears we have a contradiction between the two 
approaches, with Hakimi and Amin's results seeming to be in question. 

In their second study, Russell and Kime 13. 9 Jdefine diagnosability 
without repair to be equivalent to one-step t-fault diagnosis, or 
simply t-diagnosability. Fault masking was found to be essential in 

determining the conditions for t-diagnosability. A fault pattern 

i K i 

F-' is said to be masked by F if every test for each fault in F-* is 

invalidated in the presence of F . This concept is similar to that 

of faults completely masking others in logic networks. The masking 

K 

index is the cardinality of the smallest fault pattern P that masks 

i 1 -i ' K 

F-* and is denoted by M\P )- Thus, we see that F-J^masked by F if 

and only if t(F^) - T (F^) . Also, if a fault in F^ is not masked by 

p^, it is said to be exposed. The exposure index of a fault pattern 
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e(F^)' is the nurtiber of faults exposed in it- The minimum of 
the exposure indices of the fault pattern containing It faults is 
termed the system exposure index of order It, ej^(S). From these 
definitions, we can. find the number of faulty elements in F to be • 
|M(F^)j + je(F^)j . 

Necessary and sufficient condition for a system to be t-fault 
diagnosable without repair are given as the following i 

a) MCS) ^ t 

b) C(S) — 2t t 1 

c) - 2t t 1 - for k = t + 1,‘, - , , min (2t - l,n) 

Thus, with these results one can determine to what degree a system 
is capable of being diagnosable without repair . A major obstacle 
in working with these ideas points directly to the amount of book 
keeping involved in determining the system masking, closure and 
exposure indices. 

A final model to consider has been published recently'by 
Barsi [SJl] The diagnostic model proposed is a slight modification 
of the one introduced by Preparata, with the notion of producing 
a more realistic representation of the system- The basic assumptions 
are the following: 

1) each test is performed by a single unit? 

2) each unit must be capable of testing any other unit? 

3) no unit tests itself? and 

4) for any pair (XJ^, tfj),- unit performs at most one 

test of unit Uj . ^ 

The actual difference between this model and that of Preparata' s is 

with regard to the set of possible tests outcomes- Assuming that a 

testing link exists from 17^ to Uj, we have the following set of 

allowable t^st outcomes; 
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{ 0, if ITj_ is fault-free and Uj is fault- free 
1, if Uj_ is fault-free and Uj is faulty 
0 or 1, if is faulty and Uj is fault-free 
1^ if both. Uj[^ and Uj are faulty. 

Notice that when both U^ and Uj are faulty, the test outcome is 

always a '* 1.” The reasoning behind this is that some type of 

4 

self-checking design be incorporated into the critical parts of 
the testing devices. Thus, according to this approach, a " 1 *' 
test outcome encountered specifies that the tested unit is guaranteed 
to be faulty. 

Due to this slight difference in the model just discussed, the 
upper bound on the diagnosibility number t, is seen to increase. --g 

f 

In fact, with a system of n units, the one-step diagnosibility of 
the system is t-n-2. Observe that this largely exceeds Preparata's 
bound of t - Necessary and sufficient conditions were also 

derived for the system to be one-step t-diagnosable . These are: 
a) |b(x)^ - t ^Xi£N and b) for each pair (x.y) with 
X ^ N, ye N, (bCx)| = ^B(y)| = t and y6B(x)nD(x) there exists at 
least one node u such that either ueB(x) - B(y)f\ B{x) and 
B(u) ^ B(y), or u€B(y) - B(x)(\B(y) and B(u) 4 ^ B(x), where B (x) is 
the predecessor set of x and D(x) is the sucessor set of x- Examina- 
tion of the design reveals that only (a) of the above conditions 
reveals an optimal interconnection design- In fact, equality in (a) 
specifies that a design was indeed optimal. A technique for the 
synthesis of an optimal design was proposed, although it is essen- 
tially equivalent to the design previously discussed. 


Bar si also addressed the problem of diagnosability without |j 

i' 

repair. In doing so, the indes: f (x) is defined to be the minimum | 

H 

nuitiber of faulty units that can give rise to a syndrome of all 1 ' s ii 

I 

in the system, with the constraint that unit x is faulty. Similarly, f 
the index is defined to be the minimum number of faulty units ; 

that can give rise to a system syndrome of all I's provided that 
unit X. is fault-free. Using these, it is claimed that a system 
having a strongly connected graph is t-diagnosable with repair if 
and only if t -^max ~ Thus, we see that 

, .. . p • . 

there exists but one more way of determining the diagnosability of 
a given system- 

In conclusion, various models used in studying diagnosable 
systems have been presented. It is felt that they adequately 
represent the research that has been directed in the area, ranging 
from the basic concepts to the more advanced analysis wh'ich repre- 
sents the current state of the art with respect to diagnosable 
systems , 

With this in mind, we can now proceed to our extensions of 
this theory to I/T faults. 







Figure 3.2. a sequential 6-fault diagnosis connection 
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Figirre 3. A comparison between one-step and sequential diagnosis systems, | 
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Figurej3.4. An interconnection scheme where KCg) = 0, 
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4» . I/T Testing and Diagnosis 
4 , 0 Introduction 

Previous work on intermittent fault detection has been 
done by Breuer [3-1] and Kamal and Page [3-2], Breuer 
assumed that the statistics of the intermittent fault can be 


modeled by a two state first-order Markov process^ State FP 

corresponds to the fault being present at time t and state 

FN corresponds to the fault not being present at t . The 

transition probabilities for going from one state at t to 

* 9 

either state FP or FK at t . , are assumed known ^ From 

gtl 

this, the steady, state probabilities associated with states 
FP and FN at any time t_ can be determined as a function 

j\ 

of these probabilities at some initial time t^ . Let T be 
a collection of tests where each =" l,2f»..K , 

is a single test pattern for an intermittent fault in a' combi- 
national circuit. ,.^.T , will detect the presence of the inter- 
mittent fault under test if the fault is present when at least 
one of the Xj^ is applied. The probability that T will 
detect the presence of the intermittent fault is a function of 
Dj^ , the time between the application of tests and K , the 
number of tests applied. 

The model used by Kamal and Page is a special case of the 
above one. In this case, the transition prbbabil'^ties. between 
the two states are assumed to be equal. Thus, the first-order 
Markov process reduces’ to a zero-order process. It is assumed 
that P{w-) , the prior probability that the circuit is in to . , 

fciL. J* 



where denotes the condition of the circuit having the 

intermittent fault 1 , is known. , It is also assumed that 


ej^ , the probability that the effect of the inteinnittent 
fault is present knowing that the circuit already has 

the intermittent fault , is constant and known. After 

applying a test t^ to the circuit and observing the output f 
the posterior probabilities P (m. /output when t. is applied) 
are calculated using Bayes* rule. These posterior probabili- 
ties are used as prior probabilities the next time a test is 
applied. 




■ ■. ■ 4 - 3 - ' •- 

4.1 Coigblna-bional Circuits 

A transient fault is intermittent- if it occurs repeatedly* 

If a transient fault is not intermittent, it would be very 
difficult to test for it in a combinational network. This is 
because the combinational network would behave as if it is 
fault free after the transient has disappeared. If the trans- 
ient does not occur repeatedly we might never catch it at all. 
Therefore, intermittent/transient faults will be considered 
here. It is necessaziy to characterize intermittent faults. 

4.1.1 Model of Intermittent Faults ; 

Arrival:- We will assume that intermittent faults arrive in 
a random manner. The interarrival times of faults v/ill be 
assumed to be independent and random with a known probability 
density function. 

Duration:- After a fault arrives, it persists for certain 
time which is its duration. We will assume that the inter- 
mittent fault has a duration which is random with a known, 
probability density fimction. We will also assiime that the 
duration is independent of the arrival. 

Depending on the nature of the fault, different density 
functions might be used to model the random nature of ‘the 
interarrival times and the duration. 

(i) If an assumption that short interarrival times (or durations) 
are more likely than longer ones is used, we arrive at an 
exponential or hyperexponential density as an approximation. 
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Cii) If an assumption that there is a definite mean inter- 
arrival time (or duration) with an associated spread is 
made, we would get gamma, normal, Rayleigh, Erlang or Weibull 
approximation. Raturally, the chosen density function should 
have a value of zero for negative values of the argument 
(time, in this case) , 

4.1.2 Fault Detection: - 


•^- d— 



t^; time fault arrives 

tg^: time test (set) is applied 

d: duration of the test (set) 

It will be assumed that a fault arrival can be detected 
by a test only if the effect of the fault arrival is present 
for th- complete duration of the test, since otherwise, the 
output will change when observing for the presence of the 
fault, leading to uncertainty. When a test set is applied, 
a fault arrival can be detected only if it persists for the 
entire duration of the particular test(s)' which tests for it. 
Since tests in a test set can be applied in any order, it is 
conservative to assume that a fault arrival is not detected 
by a test set if it does not persist for the entire duration 
of the test set. Therefore, a fault arriving at t^ can be 
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detected by a test set applied at tg^ only if the duration 
of the fault t^j , is such that 


i 

Due to the non-permanent nature of the faults, it is 
necessary to apply the test set repeatedly. If the network 
gives an incorrect output under any test set application,, the 
testing can be stopped because the presence of a fault is 
indicated. But it is necessary to have a decision rule which 
will permit us to stop further applications of the test set at 
some stage when the network responds correctly to all the past 
applications of the test set. The decision rule which will be 
used here. is: The conditional probability of not detecting a 

fault given that the network has an intermittent fault i.s close 
to 0 . That is, 

P (fault not detected/network is faulty) ^ ( e ( 


Depending on the level of confidence required about the fault 
free condition of the network (if it responds correctly to all 
the test set applications), jej can be chosen to be as close 
to 0 as needed. . 


4.1.3;- Sequential Analysis 
let ^ ^ 

applied at times t^ < 


applications of the test set, 

< tst. respectively. All the 
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probabilities mentioned, below are conditional probabilities 
given that the network has an intermittent faults 

= Probability that detects the presence 

of a fault 

— Probability that there is a fault arrival at 
tf before tg^ such that its duration is 

P(Tj^) = Probability, that does not detect the 

presence of a fault 
= 1 - P(T^) 

P (T^ n T^) = Probability that none of the i 

applications of the test set detect a fault. 

All the fault arrivals which can be detected iy T2 
can be classified into two groups. 

i) Arrivals which can be detected by both T2 and 

ii) 2 ^rivals which can be detected by and not . 

Given that has not detected the presence of any 

fault, the probability that detects a fault is just the 

probability that there is an arrival belonging to group ii) . 
Hence, PCT^jT^) = PCT2) ~ P(T2*T^) where T2*T^ is the 
event that there is a fault arrival which can be detected 
by both and • 

P(T2|t^) = 1 - P(T2|t^) 

= 1 - 
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• n T^) = V{T^) 

= (i-P.(T2 )+P(T2 *Tj^)) ( 1 -P(T^)) 

V^e need to compute that neither 

nor has detected the presence of a fault, the probability 

that detects a fault is given by 

P(T3 T2hT^) = - p((T3*T2) \J (T3*T^)) 

But the event is a subset of the event ^3*^3 • Hence, 

P(T3lT2nT^) = P(T3> - P(T3*T2> 

P(T3|T2nT^) = 1 - P(T3)+P(T3*T2) 

Similarly, for any i>l , 

P(T^lT^_^n...nT^) = 1 - P(T^) + 

The decision rule employed for the termination of the test set 
applications requires the computation of the probability 
P(Tj^n...T^) which can be done as follows, 

P(T^) = 1 - P{T^) 

P(TjnTj_,l...nf^) = (l"P(Tj)+P(Tj*Tj_j.)) P(Tj„jL 

k>j>l 

As can be seen, this requires the computation of \P(T.) for 

\ ^ 

l^j£k and p(T.*T.^,) for Kj<k . 

3 3 "“** 



4.1.4 - Queueing Theory Applications :- 

The intemifctent fault model assumed can be noted for its 
similarity to the models used in Queueing theory. Hence, the 
probabilities which need to be computed can be done so using 
results from Queueing theory. 

The fault system, where is the intermittent fault 

(say on the i^ line in. a network) , has fault arrivals and 
duration. Naturally, when a fault arrives, till its duration 
ends, there cannot be another new arrival. Therefore, at 
any time, the arrival of a new fault depends on the previous 
arrival. But once a fault arrives, its duration is indepen- 
dent of previous arrivals and previous fault durations. 

A queueing system is characterized by the following three 
factors : 

1. The customer arrivals 

2. The service time of customers 

3. The service system 

The customer arrivals and service times are expressed as 
statistical distributions. The service system can be de- 
scribed by the number of servers in the system and the queue 
discipline. 

There is a one-to-one correspondence between the para- 
meters of the fault model assumed and those associated with 
a queueing system. Thus, the arrival of faults corresponds 
to the customer arrivals, the duration of faults to the service 
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time of customers and the service system corresponding to the 
fault model is a single server system with no waiting per- 
mi t ted. 

One implicit assumption made is that the arrival of 
faults begins before testing is started. However, the exact 
time of the beginning of the arrival process cannot be known. 
Hence, it will be assumed that the arrivals begin long before 
testing is started. This will permit us to use the steady 
state results of queueing theory (which are time independent 
and simple) rather than the transient ones (which are time 
dependent and hence complex) . 

The Erlang k-distribution will be used to approximate 
the statistical nature of the interarrival time and the fault 
duration. If k equals one, this reduces to the exponential 
distribution. When the Erlang k-distribution is used, the mean 
and standard deviation equal that of the practical problem 
and yet, some of the properties of the negative exponential 
distribution are maintained. The arrival (or duration) is 
divided into a fixed number of independent, identical 
(hypothetical) "phases”, each phase having a negative exponen- 
tial distribution. Each time a phase ends, an arrival takes 
place or the duration ends as the case may be. The Erlang 
k-distribution is, 

k k— 1 

f(t) = t^ e 

(k-1) ! 
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With the above assumptions, some correspondence relations 
can be stated. Using these relations, probabilities required 
for fault analysis can be obtained from the equivalent Queueing 
model which is a single server system V7ith no waiting allowed. 

is the type of intermittent fault present, with arrivals 
and duration. 

(i) P (effect of a fault arrival is present at time t) “ 

^ P (A customer is in the service facility at time t) 

(ii) P (effect of a fault arrival is present from time t to 

at least t+d) ^ P(f^nrf^>d) -£ >• P(A customer is in 

the service facility at time t and will require at 
least d more units of service time) . 

(iii) P (effect of any fault arrival is not present at time t) 

^ ^ P (there is no customer in the service 

facility at time t) 

(iv) P (effect of any fault arrival is not present from time 

t to at least tts) ^ P(^^nrf^^s) ^ P (Wo customer 

is in the service facility at time t and no new cus- 
tomer will enter the service facility at least till 
time t*{-s) . 
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4.1.5 Single Fault Detection Procedure 

Let the set of faults in a circuit be F = }, 

1 2 , . n 

where each is a single fault, A fault event occurs 

when a single fault f^ from F occurs. When the assumption 

that only a single fault can occur is -made,, the n faults in 

F are -not independent occurrences. Therefore, the arrivals 

and duration considered will be of . 

f 

We will assume that each test set application T^ has a 
duration d . It will also be assumed that the elapsed time 
between successive test set applications T. , and T. is 

j-1 j 

a constant, L , A test set application T^ applied at tgj|^ 
will detect the presence of a fault only if the effect of a 
fault arrival is present from t_ . to at least te.+d . There*^ 
fore , 


P(Tj) = P(f^ n rf^>d) 


( 1 ) 


A fault arrival which can be detected by also be 

detected by T. only if the arrival occurs before t and 

^ ®j"l 

is such that the effect of the fault arrival is present from 


t , to at least t • ’+dtL . Therefore, 
®j-i ^j-i 

P(T.*T._, ) = P(f. n rf->L+d) 


(ii) 


The above probabilities are independent of t„ because we 

^i 

use steady state results. 
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Hence, given k applications of the test set, the 
probability that none of them detects a fault is, 

_ _ k 

P (T, n . . ) = [1-P (f . nrf . >d) ] “1 [“ (i-p (f nr f . >d) +P (f . nr f - >Ltd) ) 

^ i *1* II i X~ IL X“ 

j=2 

Note that above probability is the conditional probability 
given that the network has an intermittent fault. This is 
because the model we have- assumed guarantees at least one 
fault arrival. 

Now we consider 4 cases of arrival and duration. 

i) Exponential interarrival time: fCt^)==Ae ^ 

Exponential duration: f (t^) =lia^^*3. 


P(f) 


A+y 


P(fnrf>d) = e^^ A 

P(fnrf>S) = e^^ y 

A-i-y 

Hence , 








4-13 


^ t 

(ii) Exponential interarrival time:-- f(t ) = A.e ^ 

^ li“ t§“l 


k-Erlang duration (k=m) f 


(m-1) I 


in 


P(f) = I P(n stages of duration remaining) 
n=l 


l+li 


PCf) 


li 


X+y 


P(fnrf>d) = 


m , 


2 P(n stages of duration remaining 
n=l and remaining duration is ^d) 


m , - dy'm-tl ‘ „ 

= 1 =4x+rrr e I (»d)^ 

n=l S^O Si 


q 

E(fnrf^S) = e^ y 

X+y 


Hence , 



P (Tj^ ^^k— 1 * * " 





-*• 

- 





m 

1- I 

n=l 

i (Kd)® 

m(Xty) g£(j Si 

X 

1- 

Y i 

nSl 

-va 

- 

- 


- 


,1 




m-1 

slol 

(yd)® {yd+yL)®-yLl 




Si 

Si ) 


* J 

(iii) k-Erlang interarrival time (k=Jl) r f(t_)"^^: — _ S_ — 

^ (£’l) I 

Exponential duration: ^ ~ \ 
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P{f) - 


a 

2 P {n stages of arrival remaining and fault in 
n=l 1 


in duration) = — 


X _ (l+-tL) 


P(f) = 


I--' + ^ 

y y 


(*)“ 


P(fnrf>^d) = X fl“ ^ 

y 


P(fnrf^S) = ^ P [n stages of arrival remaining with no fault 

n=l 

in duratioh and remaining stages need more 
than S units of time) 


K H“i) 


,U"(^+1) A"1 Trt „ 

' I 5^® qs)'^ 


<i=-o 


qi 


Hence , 

P(Tj^n...T^) = 




(iv) k~Erlang interarrival time {k=£) , §■ 


.Z.5L-1 -Xt* 
X t_ e 


{Z-D I 


k-Erlang duration (k-m) , f (t^) ~ ^ e^^*^ 

(m-1) I 


First, it is necessary to solve for the following mi+i 
simultaneous linear algebraic equations. 

XP(r,-;0) = XP(r+l,-,0) + iJP(r,l;l) 


4-15 


APCa,-;0) - 11P(£,1?1) 2r-l,2, 

(X+y)P (r,n;l) = AP(r+l,n;l) + iiP(r,n+l;l) 

r=l,2, * , . ,£.-1 
3^“1 ^ 2 j » /in,“l 

(A+ii)P(r,m;l) = AP(r+l,m;l) 

r=l,2, ... ,£.“1 

(X+ii)P (£.,n?l) = jiP (Jl,ii+l;l) 

H"1 f 2 f * m m f m— 1 

(A+ii)P(£,,m,l) = XP(X,~;0) / 

1ml 

I I I P(3:,n,j) =1 , where 

j=0 n-1 r=l 

P(i,j,l) ^ Probability, that there are i stages of arrival 
remaining and j stages of duration remaining and a 
fault in duration. 

*P(i,-,0) ^ Probability that there are i stages of arrival 
remaining and no fault in duration. 


The required probabilities can be obtained as follows, 
m £. 

P(f) = I P£q,n;l) 

n=l q=l 


P(fnrf>d) = 


m‘ 

I I P(q,n?l) 
q=l n=l 


-]id m-1 (Ud)^ 
® I SI 

s=o 





4-16 


P(f) = I P(q,-;0) = l-P(f) 
q=l 


P (f nrf >^S) 


A 

I P(q,-;0) 
q=l 


5.-1 



I 

c=0 


qs)° 

cl 


Hence, 

= 


£ m 

I P(q,n?l) 
q=l n=l 


“pd m -1 

® 1 

S=0 


SI / 






m 

^ P ( 5 , n 7 1 ) 
n=l 




(pd-hjtL) 

SI 



k-1 


3 


The probability that a fault is detected is 

Pj^ = 1 - P (Tj^n . . . . nTj^) 

In practice, a fault can be detected if it exists for the 
entire duration of the particular test(s) which test for it. 
Since we required the fault arrival to effect the entire 
duration of the test set, the actual probability of fault 
detection will be greater than the obtained above. But 

the above procedure can be easily extended to give more 
accurate results. 

Let the test set contain g tests. Each test has a 
duration of d/g , where d is the test set duration. Let 
the set of possible f aults be F == {f^,f 2 , . . . - Let 
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be the conditional probability that fault has occurred 

given that has occurred. Naturally, 

n ' 

I ■ Pfi - 1 

i=l ' : 

It is necessary that P-f be known for each i . By consider- 
ing those tests which test for fj|_ , the probability P (i) 
that f^ is detected can be computed using the results developed 
above by substituting tests in place of test set everywhere. 

Then, • . ^ • 


n 





This is the actual probability of detection of a fault. 


4.1-6 Multiple Fault Deuection Procedure 


Let the number of lines which can be faulty be n. All these 
lines are a^^umed to have identical fault characteristics - It is 
also assumed that the probability ihat b lines are faulty is the 
same as the probability that a single line is faulty. Due to the 
large density of circuits in present day IC^s, this seems a 
reasonable assumption. We also assume that an intermittent fault 
on a line is either of the s-a-1 or s-a-0 type but not both. The 
arrival and duration of the intermittent faults corresponding to 
each of the n lines are similar to the corresponding ones of 
as used in the single fault case. 

Here, it will be assumed that a multiple fault can be detected 
only if it is present for the entire duration of the test set 
(MFDTS) . This is a reasonable assumption because, if any component 
fault in the multiple fault is present only for part of the duration 
of the test set, neither of the two different fault situations 
would be detected if the particular tests whi.ch test for them are 
not applied during their presence. All the assumptions regarding 
the test set applications will be the same as before. The 
probability that a given line in a network is faulty - 
The probability that the given network is faulty is, 

<b> 

The probability that exactly b of the n wires are faulty is. 
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Given that a b-wire multiple fault has occurred, each of the 
component wires of the multiple fault has fault arrivals with 
duration. Let the fault corresponding to the b wires be labeled 
A multiple fault involving f^ can be detected by a 
test set application if the effect of a fault arrival of f^ exists 
for the complete duration of the test set, d and none- of the ouher 
faults have an arrival whose effect exists for only part of the 
duration d. 

P (multiple fault involving f^ is detected) 

= P (f nrf^d) (P (f nrf^d) tP (rrirf^^d) 1 
P (multiple fault involving f 2 hut not f^ is detected) 

= P(f^ has no fault arrival with an effect on test set and 
has an arrival with effect for the complete duration d and none 
of the other faults have an arrival with an effect for only part 
of the duration d) 

= P (^nrf^d)P (f nrf^d) (P (f nrf >^d) +P (f nrf_>d) ) 

* -i' 

\ 

The probability that the b-wire multiple fault is detected by a 


test is. 
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S^{d) = 


P (f nrf^d) 


f iP(fnrf^d)^ ^(P(fnrf^d)-{-P(fnrf>d) ^ 
i=l 


Therefore, the conditional probability that a test set application 
will detect a multiple fault given that it has occurred is, 
n 


P(Tj) 


I 

b=l 


cb 


S^{d) 


SI 

- I 

b=l 


2^-1 


P(fnrf^d) X 


( I (P{fnrf>d)^"‘^(P(fnrf>d)+P(fnrf>d) 
i=l 


A multiple fault arrival which can be detected by T._^ can also be 
detected by Tj only if all the component fault arrivals occur 
and are such that their effects are present from 


before t 


’j-1 


t to at least t„ 

s . , s 

:j-i 


j"i 


+ d + L. This probability is Sj^ (T+d) , 


where we substitute T+d for d. Hence, 

n- 

i 

b=l 


P (T . * T .-1> = J It+d) 


n (b> 

= I — . P(fnrf>T+d) X 


n 


b=l 2-1 


u ■ >1 Vi * 

( I (P (f nrf^T+d) (P (f nrf >_T+d) + P (f nrf^T+d) ) 
i=l 


Hence, given k applications of the test set, the probability that 
none of them detects a fault is, 
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P (Tj^ ... nT^) 


n (, ) 
r i5 


Cl- I 

b=l 2-1 


P{fnrf^d) X ( I (P{fnrf>d)) 
i=l 


i-1 


— • — b— i r 

(P(fnrf^d)+P(fnrf>d) ) ) (1- I — — 

b=l 2^-1 


(P{£nrf>d) x 


*1 1 * 

( I {P (fnrf>d) ) (P (fnrf^d)+P (fnrf>d) ) -P Cf nrf >T+d) x 

b _ _ 4 1 _ _ 

{ -I (P (f nrf >T+d) ) (P (f nrf >T+d) + P (f nrf^T-Hi) ) ) 

i=l 


Now we consider one of the 4 cases mentioned before 


Exponential inter arrival time: 
Exponential duration : 


f(t ) = Xe^'^^a 

cl 

f(t,) = 


*** ^*^1^ 

( 1 - . I - I - . 

b=l 2-1 


-wd _1_ . y jAd _H_ . 


Xd y , i-1 
X+y ^ 


_JL_ + -A_ (1- y . 

Xty X+y ' 2^-1 


X+y ® 


~yT 
- e 


,-X(T+d) y . -y(T+d) X ib-i. 

X+y ^ X+y ^ ^ 

4.1-7 Conclusion 

As can be seen, the expression gets complex. However, if the 
actual values of the various parameters are substituted, the 


computation can be performed in a systematic manner, very easily on 
a computer. The various parameters can be estimated using methods 
employed in Queuing systems. 

We have obtained expressions to determine the number of 
times a combinational circuits has to be tested, when checking 
for the presence of single or multiple faults of the intermittent- 
transient type. Though dependent on the model used, since we 
have used quite a general model, these results should be useful- 



4.2 Ssauentiai Machines : - 


A sequential machine can be tested by applying an appro- 
priate sequence of input signals, termed a checking sequence, 
to the circuit and observing the output sequence that the 
circuit produces in response [3-3] . The checking sequence 
determines whether or not the sequential machine is operating 
in accordance with the given state table description rather 
than testing for specific hardware failures. Hence it is 
difficult to find the precise relationship between the presence 
of an I/T fault and its effect on the output sequence. Though 
the output of the sequential machine may be correct during the 
presence of the fault, it could be incorrect at a later stage. 
Therefore, it is convenient to model the faulty sequential 
machine as a probabilistic sequential machine. 

4.2.1 The model :- 

If the statistics of the I/T fault are known, at a given 
time, the probability that the effect of the I/T fault is 
present, can be calculated. A particular fault will affect 
the next state and output functions in a particular way. By 
knowing the exact way in which each fault will affect the next 
state and output functions along with the relative probabili- 
ties of occurrence of these faults and their statistics, the 
faulty sequential machine can be modeled as a probabilistic 
sequential machine. Instead of the exact model, it is possible 


4-24 


to set up an approximate model , with relative ease, by assum- 
ing that every possible combination of incorrect next state 
and/or output is equally likely when the effect of the I/T 
fault is present. In either case, we arrive at a probabi- 
listic sequential machine model of the faulty machine. 

4.2.2 Testing ; - 

The actual application of the checking sequence is pre- 
ceded by the application of a homing sequence to bring the 
machine to a fixed starting state. Initially, if we assume 
that the machine is equally likely to be in any of its states, 
the probability of the machine being in any final state after 
the application of the homing sequence can be computed using 
the transition probabilities and the output response of the 
machine to the homing sequence. 

If the initial state probabilities are known, the proba- 
bility that the machines ' output response to the checking 
sequence is correct can be easily found [3-4] . This represents 
the probability that the test does not detect the fault, given 
that the machine is faulty. Therefore, the probability that 
n applications of the test fail to detect the fault given 
that the machine is faulty, can be calculated. 

4.2.3 Conclusion ; - 

The testing of sequential machines for I/T faults is 
straightfo3Tward once an exact model of the faulty machine as 
a probabilistic sequential machine is obtained, due to the 
results already available in this area. Hence, in this sec- 
tion,. we have just outlined the technique. 
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4.3 Self diagnosable systems ; 

Self diagnosing capability is becoming an important requirement 
of systems as their complexity increases and greater emphasis is 
being placed on their reliability. Design conditions for such 
systems where the units are capable of testing each other have been 
studied for permanent failures of units. One such system is the 
t" fault diagnosable system proposed by Preparata et al. We shall 
study the capability of the t*- fault diagnosable systems to diagnose 
intermittent/ transient (I/T) faults in units, 

4.3.1 Prel iminar ies : 

We shall assume that a fault free unit correctly evaluates the 
tested unit as being faulty or fault free while a faulty unit's 
evaluation of the tested unit could be incorrect. Under such 
circumstances, the diagnosis of the faulty units is achieved through 
the results of the tests performed by the fault free units. 

When a unit has an I/T fault, it may have to be tested several 

times by a fault free unit before correct evaluation can be performed. 
Therefore, we will assume that after every test routine, an updated 
syndrome (set of test outcomes) is formed which describes the evalu- 
ation of all the units to date. Anytime the updated syndrome 
corresponds to a consistent set of faults, diagnosis can be performed. 

Because of the time delay from the initiation of the I/T fault in a 

unit to its detection, it is likely that additional units could have 
faults initiated in them in the mean time. Therefore > even if 
certain units are diagnosed as being faulty, one cannot be absolutel 3 ’‘ 
certain that no more units are faulty. Hence, incomplete diagnosis 
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is inevitable- We shall designate the I/T fault capability of a 
system, t*, as the maximum number of units which could be faulty 
such that the diagnosis is at worse incomplete but never incorrect, 
i.e.; a fault free unit is never diagnosed as being faulty. 

4 , 3.2 The Two Partitions 


Pig. 4 . 3-1 

S^, ^ sets of units, n S2 - 0 and ~ 

R being the remaining units in the system. Because the system is 
t-fault diagnosable, it is not possible that neither nor S2 
receives any testing links from R. Therefore, there are only 2 
possibilities; 

i) Only one of 8^,82 receives links from R. In such a case, there 
is a non-zero probability of diagnosing 2^(82) as being faulty when 
infact 82 is faulty, if 3^(82) receives no testing links from 
R. In Pig. 4 . 3 - 1 , if 82 is the faulty set of \mits, it is possible 
to obtain an updated syndrome where, due to insuffient testing all 
links from R to 82 are 0 -links. If in addition, all links from 82 
to 8^ are 1-links and all links within 82 are 0-links, regardless 
of the nature of the links from 8^^ to 82, this syndrome would be a 
valid syndrome and would correspond to a fault pattern where 8^ is 
the faulty set of units. So we could diagnose as being the 

faulty units when in fact 82 is the set of faulty units. 
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ii) Both and S2 receive links from R. In such a case, there is 
a zero probability of diagnosing (S^) as being faulty when in fact 
82(5^) is faulty. This is 


Fig. 4 . 3-2 

because a valid syndrome corresponding to a fault pattern where 
is the faulty set of units would require all the links 
from R to 2^(82) to be 1 -links and this could never happen if 
82 is indeed the faulty set of units. 

4 . 3.3 I/T Fault Diagnosability 

We can describe the I/T fault diagnosing capability of a 
t-fault diagnosable system by an index t*. If the number of faulty 
units does not exceed t' , there is a zero probability of diagnosing 
a fault free unit as faulty. This requires that given any -2 sets 
of units 8^ and S^r IS^[,jS2| £ h*, 8^ n 82 = both 8^^ and 82 
receive at least one link from H. This guarantees that even if 
^1' ^2 ^ sets of units such that |s^|,|S2| 5 t’ and 

8^ n 82 / 8^ u 82 7^ 8^, 8^ u 82 7^ 821^ there is a zero probability 

of diagnosing a fault free unit as faulty. The reasoning is as 

follows. _ 

© 

(sg) 

Because ^ ^2 ^ disjoint sets of units with 

cardinality £ t’ , ^ ^2 least one link from R. 
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Therefore, if S^CS^) is the faulty set of units, there is a zero 
probability of diagnosing S 2 (Sj^) as being faulty because the links 
from R to n ^ would never be 1-links. 

Therefore, when a set of units is diagnosed as faulty, there is 
a 100% probability that all those units are faulty. if the number 
of faulty units is £t’ , Hence the diagnosis is incomplete at 
worse but never incorrect as far as the faulty units are concerned. 
If a set of units is faulty, it is always possible to diagnose 
only a proper subset of as being faulty. 

* 

Fig. 4.3-3 



In Fig. 4.3-3, and S ^2 proper subsets of such that 
Sfi u When is faulty, due to insufficient testing, it 


is possible that all the links from R to S ^2 0-links, while 
all those from R to are 1-links. If, in addition, all the 
links from S ^2 1-links and all links within S ^2 

0-links, the diagnosis will designate faulty ^lnits. 


As long as the number of allowable faulty units is greater than 1, 


this sort of incorrect diagnosis has a non- zero probability. 


There are several partitions of a system into S^, S 2 and R 

(as in Fig, 4.3-1) such that receives no testing links from R. 

t’ has to be less than max (}S^[,|S 2 l) of each such partition 

t* = min (max ( | S., | , | S„ ] ) )-l 
’ ■ over all ^ ^ 

partitions 


4-29 


We shall now find bounds for t * . Let us define xj as the largest 


integer smaller than x. 


4.2.4 Bounds for Asymmetric Testing 


Lemma 1 : In any t-fault diagnosable system where no two units 

test each other, the minimum value of t' is 

Proof ; Let k be the number of units in S^. The max number of 
links possible within - k , This would require the 

smallest number of links incident on S^, from outside S^. Since 
each unit has to be tested by at least t others, the smallest 
number of links incident on from outside is kt- . xf 
m is the cardinality of the smallest m will be needed when 
each unit in tests each unit in S^. Therefore, the smallest 
size of is given by 


Km = kt- 


(k-l)k 



Since m has to be an integer , in any t-fault diagnosable system, 
if has size k, the minimum size of S 2 is It is 

possible to design a system 



123 fc 



/"'A 
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which has a partition as in Figure 4.3-1, and these values of 
js^l, and js^l - As the value of k increases, the minimmn size 
of decreases. The minimum value of max Cis^|,iS 2 t) occurs when 
iri=k , as can be seen from Pig. 4.3-4. 


m = k 


(k-1) 

2 


m = k = 


2t+l 


If —y ■ is an integer , max { ] | , | | ) = since, ^ 

when is _.ot an integer > for k - J 

“ Minimum value of max { 1 ( , j S 2 [ ) = ^ 

Since t^ = min (max ( | S, | , [S„ | ) ) -1, 
over all 
partitions 

= fit+ij- 1 


= - 1 / 2 / 


mrn 


I “T 



Q.E.D. 


Lemma 2 ; .• The maximum value of t' is t-1. 

Proof: Since the system we consider is t-fault diagnosable, there 

exists a fault pattern comprising (t+1) units which cannot be 
diagnosed. Therefore, t' cannot be greater than t. 

Since the system is t-fault diagnosable, there is at least one 
unit i/ which is tested by exactly t units. Hence, the system has 
a partition as in Fig. 4.3rl, with consisting of i and S 2 oi the 
t units which test i. Therefore, in any system minCmaxC |S^ | , | S 2 } ) ) 
is at least as small as t. Therefore t’ can never exceed (t-1) . 

We shall show that the maximum value of t* is (t-1) by citing 
a connection assignment where it is so. Let us consider the ^ 
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connection. Let us try to form a partition as in Fig. 4.3-1, by 
starting with a unit j in R. 


Now S. can only be a subset of M., where M. is the set of units 

3 3 

not tested by j. Because of ■ the .j. connection, any will 
have one unit which is not tested by any of the remaining units 
in S^. Therefore, S 2 has to have a size of at least t. Therefore 
has a value of t-1 in a system with ^ ^ connection. 

Q.E.D. 

4.3.5 Non-Asymmetric Testing 

Now, we shall consider a system which contains pairs of units 
that test each other. The necessary and sufficient conditions for 
such a systemto be t-fault diagnosable were formulated by Hakimi 
and Amin. We shall find bounds for t* for such systems. 

Lemma 3 : In any t-fault diagnosable system where some pairs of 

and can be 

at most, equal to t. 

Proof : (I) One of the conditions necessary for a system to be 

t-fault diagnosable is that for each integer p with 0<p<t , given 
any set of units R with {Rj—n-2t+p, the largest set of additional 
units with every unit in being tested by at least one unit 
in R, must be such that ]S 2 l>p- 


units test each other. 


cannot be less than 


1 2t-{*l [ 
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For any given value of p and a partition as in Fig.. 4.3-1, 
since decreases as js^l increases, the smallest value of 

max{ Is^I , IS 2 I ) occurs when lSj^|= js^l - Let IS 2 I = P+k where 
k > 0. Of all possible values of k, for a given p, there exists 
only one, k* when |s^j = IS 2 I - 

js^l = n-jul -IS 2 I = n-(n- 2 t+p)- (p+k*)=p.+k* 


3(p+k*) = 2t+k* 

p+k* = = 



The min mum value of k* is 1 and there exists a p for which this 
equality holds. 

I 2 t+l I 

Therefore, t' cannot be less than I — 3 “/ *■ f^:i) t' obviously 
cannot be greater than t. Consider a system with 2t+l^n<2t+3. 
Construct a D_ system with a bidirectional link between units 

Lft, 

i and j if |i-j|=l mod n. In this case each unit is tested by (t*l) 
other units. If. a unit k is in R, can only be a subset of 
the set of units not tested by k. Any such will have a' unit m 
which is tested by atmost one other unit in and a unit n which 
is tested by at least one unit p not in such that p does not 
test m . Therefore, every S 2 has a size of at least ttl and 
hence for such a system t'=t. Hence, the maximum value of t’ is t, 

Q.E.D. 

We have established bounds for t' . For any given connection 
assignment, the exact value of t* can be determined by examining 
all partitions as in Fig. 4,3-1. We shall now give a an algorithm 


to do this 
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4.3.6 Procedure 


Let us denote by t^ , the smallest upper bound on the value of 


the initial value of t* 

m. 


t’ , based on all the available information at any stage. We know 

Then, we examine all Type A partitions, 

partitions as in Fig. 4.3-1 with IS, j < t' & !s„l < t’ , which 

contain a unit i in E. . If there exists such a partition, we update 

t^ and repeat. If there exists no such partition we examine Type A 

partitions which contain a unit j but not i in R and so on till all 

possible partitio<Rs have been examined. After all the partitions 

have been examined, the current value of t^ is the value of t'. 

m 

Let us define. 


= set of units tested by unit i 
r . set of units testing unit i 

I (r^urjUiuj) I 

Qij= I (r"^ur"^uiuj) I 


Procedure 1 ; This will be used to find an upper bound, sufficiently 
lower than t^ , on the maximum possible size of in a type A 

partition, when a unit i is in R. 

Lemma 4 ; The minimum possible size of (S 2 UR) is k only if there is 

a set C of at least (k-t^) units such that for each pair of units 

m,neC, P <k . 

' mn— 

Proof: Deleted because it is obvious. 


Every set C satisfying the condition in the Lemma is a possible 
candidate for R but IS2»JR-1 must be evaluated to make sure that it 
is so. The minimum possible size of {s,.jR) establishes a limit on 

l®lLax’ 
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We start with a unit i in R and establish a lower limit on k 

by first finding the smallest k such that there are at least 

(k-t‘-l) units such that for every such unit i, P. ,<k. If the 
m ^ j r 1 j~ 

lower limit is to be increased, every possible C has to be formed 

and checked. If any C results in a type A partitions, the value 

of t^ is updated, the new t^ being (max ( | | , [ S 2 1 ) -1) for that 

partition and the procedure is repeated till the value of t* is 

m 

unchanged during the iteration with that value of k. If there 
exists no type A partition with |S 2 uRl=k, we can increment k and 
start all over again. We can do this till we have fs, f to a 

number sufficiently smaller than t^ . 

. The various C's can be evaluated by converting a Boolean 
expression in a product, of sums form to a sum of products representa- 
tion. e.g. , if then j and n,m cannot be in the same C. 

We express it as ) . The C’s are evaluated from 

C. = n (X.N.+X.)X. where N. = „ X 

1 j 3332 . 3 q,st ‘.q 

Only those X. are considered which have P. .<^k and at least 
j ^II 

(k-t^-1) such other terms not in N^. No reductions are performed 

on the sum of products. Also note that if a product term contains 

more than (k-f) literals, all combinations of size >k-t’ are 
m — m 

possible C's. 

Procedure 2 : When a unit i is in R, can be formed only from 

M^, the set of units not tested by i. After establishing an upper 
bound on using procedure 1, procedure 2 can be used to 

look for type A partitions. 
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Lemma 5: A type A partition exists with js^ I = IS-, I only if 

" -J. JL max 

there exists a set of units D in M. such that IdI = Is^ I and 

for every pair of units ra,n in D. Q < is^ I ; + t'. 

Proof ; Deleted because it is obvious. 

Every set D satisfying the condition in the Lemma is a 

possible candidate for in a type A partition. However, each 

D has to be examined individually to check if 1 S 2 1 ' 

After starting with a unit i in R and arriving at a and 

is, I by using procedure 1, we can find a lower value for is, | ? 

by first finding the largest number w' such that there are at least 

w units in such that each of them satisfies the condition on Q 

with at least (w-1) other units. If |s_j is to be lowered, 

every possible D has to be formed and checked. If any D results 

in a type A partition, the value of t^ is updated and the procedure 

is repeated for the same is also possible to try to 

reduce |S^ |: „ . by repeating procedure 1 using the new value of ■ 

t* . If there exists no type A partition with Is, | = IS, 1 , we 

m jr ' 1' ' I'max 

decrement js. [ and repeat. We do this till js, } ^ is reduced 
to 1. 

The D*s can be found in a manner analogous to that for finding 

the G’s. The Xj's considered are those units in which satisfy 

the condition on Q with at least (s, I -1 other units in M., the 

* 1 ' max X 

X ' s representing units not satisfying the condition on Q with X . . 

^ 3 

Also, if a product term in the sum. of products representation has 
more than literals, only combinations of size are 

to be considered as candidates for D. 
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Procedure 2* : After we have examined all partitions which contain 

a unit i in R and are examining partitions which contain a unit j 

in R, we can try to avoid examining some D’s which have already 

been examined before. We can divide M. into M.nM, and M.nM. . 

3 : 3 . 3 1 

c 

Eveory D must have at least one unit from in order to be an 

unexamined one. Therefore, IS-i will be determined by the 

‘ 1 ' max 

Q 

maximTim size of D containing a unit from M.nM. . 

3 1 

After examining partitions containing units i^ri^/-. -ip in R, 

the next unit we pick should be a unit j which has the smallest 

P . . for all j and all i in It ... i . We will then partition 

■’ sip ^ 

M. into M.nM. and M.nM.c and use procedure 2 with the exception 


• 3 


3 1 


that D's not containing any units from M.nM. are not examined at 


3 K 


all. 


4.3.7 An Example 

We shall now use an example to clarify the algorithm. In 

Table 1 is given a connection assignment, with the P's and Q's 

given in Table 2. ■ ~ 

Procedure 1 : We shall start with unit 3 in R. Initially t^=t-l=5. 

We need at least (k-6) units with P^.<k. Therefore, the smallest 

3r— 

value of k is 9. The possible candidates for R with jRuS^I^Sr 
are units 3,2,6 and 10. We now form C^- 

Cj = X 3 (X 2 X 3 ,g+X 2 ) (XgXj^u+Xg) (X gX g +^1 ) 


At any stage, we multiply 2 sum of products terms only if they have 

at least one common x.. The C's can be easily formed from these 

3 

disjoint sum of products terms. 

=3 = =‘3'='2=^6^XO-'^10^2^6> 
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The maximmti possible size of C is 3 and Leitima 4 is not satisfied. 
Therefore, we check for k=10. Now, the C*s can be selected from 
units 3,2,5,6,10,13,15. We again form C_. 

^10*^2^5^15' '*15^10’'l3'*'*is' *’'l3^2^5*^lS'*‘*13' 

^10 ^^5 ^13 *“ because every C must have a 

size of at least 5 and they satisfy the condition on P with only 
3 other ^Cj's. Since there are only 4 possible candidates 
remaining, there exists no C satisfying Lemma 4 for k=10. Therefore, 
the smallest [.S^urI is greater than 10. Now we can switch to 
procedure 2 . 

Procedure 2; Since the minimum size of k is >10, IS, 1 =4, 

' 1 ' max 

The units not tested by 3 are 2,4,6,8,9,10,12,14 and 15. It can 
be seen from the table that none of these units has more than 1 unit 
satisfying the condition on Q. Therefore, there is no type A 
partition containing unit 3 in R. Now we look for type A partitions 
not containing unit 3 in i. 

4.3,8 Cone lusion 

We have established bounds on the I/T fault diagnosing 
capability of t- fault diagnosable systems and given an algorithm 
to determine this value for any connection assignment^^^^^-hce^^his 
does not take into account the statistics of the I/T fault, this 
can be looked on as the minimum capability of the system. 


001010010011100 
00011010001011 0 
010001011001001 
100000000011111 
0 0100 0110100110 
100100010101100 
101101001100100 
010100100011001 
lOOlllOlOOlOOOO 
10 0 100011001010 
001010100101001 
010010101000011 
001000010111010 
100001 1 11010000 
10 0 010101100100 


‘ <m,n) = 1 iff la is tested by n 
Niamber of units =: 15 t = 6 
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5 , Tolerance Strategies 

5 ■ 1 Introductory Sta-faements : 

Given that a machine is imperfect, information may or 
may not be known about the types of phenomena which will 
prodii.ce I/T faults in it. In either case, the problem is the 
same: how can the circuitry be arranged to eliminate the I/T 

faults or the effects of the I/T faults? Is enough informa- 
tion available to do this? If not, what can be done to make 
possible tills goal? This report is intended to address these 
problems. Material pertinent to this subject is listed in 
references 5-1 through 5-10. 

5.2 I/T .Fault Intolerance : 

The first method is to attempt to eliminate the I/T faults- 
This approach is called fault intolerance. The most reliable 
components are used in constructing equipment to lower the proba- 
bility of overall failure for a given mission time. No redundancy 
is employed, and in general, all components must function prop- 
erly for the system to operate. The overall reliability is the 
product of the component reliabilities, and since these have 
been arranged to be as high as possible, the system will have a 
high reliability. 

There is a limit to how much the reliability can be in- 
creased, both physically and economically. The 1/T fault in- 
tolerance approach adds nothing new to the structure of a sys- 
tem, and better reliabilities are required, so the next approach 


is considered 
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5*3.2 Sequential CirGiii-fcs : 

When dealing with sequential circuits, X/T faults may 
affect them after the fault disappears. Wakerly points 
this out in [5-2] . If triple modular redundancy is used on 
a level low enough so that the replicated modules are com- 
binational, the overall circuit may be sequential in nature 
but the redxmdancy scheme will filter out all single I/T 
faults per module. If it is desired to triplicate modules 
which are themselves sequential, then a build-up of faults 
can occur. I/T fault in a sequential machine can change 

the machine *s state. This can clearly lead to improper 
execution. Therefore, applying triple modular redundancy 
to sequential modules is more complicated than to combina- 
tional modules. However, it has become increasingly more 
important to apply redundancy to sequential modules. Break- 
ing a circuit down into such small portions so that they are 
all combinational results in modules which are comparable in 
complexity to the voters themselves (not very complex) . It 
is doubtful that fault tolerant machinery constructed in this 
manner would be much more reliable than the original non- 
redundant machine . 

The level of complexity of commercially available inte- 
grated circuits is constantly rising, and due to this con- 
straint many times it is impossible to break a circuit down 
into combinational modules. Wakerly shows that the modules 
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5.3 I/T Fault Tolerance : 

The idea behind fault tolerance is to utilize redundant 
circuitry to eliminate the effects at the output to internal 
faults. The fact that faults will occur is accepted; only 
through redundancy can the effects be eliminated.' 

5.3.1 Faults in Combinational Circuits ; 

In a combinational ■ circuit, the worst that an I/T fault 
can do is to cause the circuit to produce erroneous output 
while the fault is present. If measures are' taken to insure 
that these erroneous outputs are masked, the task is accom- 
plished. The job is simplified because there can be no after- 
effects of the I/T faults. 

Triple modular redundancy has been used at this level, 
as shown in Figure 5-1. If no more than one module in a 
triple experiences a fault at the same time, the voter output 
will be correct. Depending upon the voter reliability, single 
or triple voter schemes are used. With combinational circuits, 
triple modular redundancy may be used at any level- Circuits 
may consist of a single overall triplication, or of triplication 
of modules which are triplications of modules, etc. Note that 
while transient faults have been implied in this discussion, 

single permanent faults per module will also be masked. Natu- 

\ 

rally, another I/T fault in another portion of that same group 
can cause system failure. 
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must, be restorable, and that restoring inputs must be applied 
in. normal operation. Some circuit structures directly lend 
themselves to this automatically in noimial operation, v/hile 
for others . special resynchronizing inputs must be devised and 
applied. In the latter case, the problem is that only single 
faults per module can be guaranteed not to affect the output, 
between applications of the resynchronizing inputs. 

If enough is statistically known about the I/T faults, 
then the overall reliabii-ity can be computed for various re- 
synchronizing input frequencies, 

5, 3, 2.1 Larger Scale Modularization ; 

In E5-3] Wakerly describes a method of utilizing triple 
modular redundancy with microprocessors with associated memories. 
Considerable discussion is given to the voter placement problem. 
The resultant system constantly runs a program of resynchroniz- 
ing routines which restore the registers on the central pro- 
cessor chip as well as the eKternal memory. To prevent a single 
microprocessor which has gone awry from deteriorating the system 
periodically the processors are restarted, 

Wakerly ^s scheme will certainly work, but it has limita- 
tions. Due to the choice of placing the voters after the memory 
instead of between the processor and the memory, if a failure 
occurs within a processor, its faulty data can be placed any- 
where within its external memory. To keep the memory clean, 
the entire memory must periodically be completely rewritten. 
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It is not sufficient to wait until an error is detected and 
only rev^ite the present address location. The problem is 
that for any reasonably large memory it will take a consider- 
able amount of time to totally rewrite. Furthermore, the 
cleansing must be done fairly frequently to capture as many 
I/T faults as possible. This leaves very little time for the 
processors to perform the original task assigned them. For 
these reasons it believed that the triple modular redundancy 
scheme for microprocessors described in [5-3] needs improve- 
ment, 

5, 3. 2. 2 Special Problems with Microprocessor Implementations 

To a limited degree, fault tolerance implementations for 
low level circuits exist. Designing a random logic-circuit to 
be fault tolerant is possible primarily for two reason^: 
analysis and test set generation. Due to the low level nature 
of the circuit the effects of any particular fault can be 
analyzed. It is only because I/T faults injected into a circuit 
can be analyzed that it then becomes possible to generate test 
sets. The problem of test set generation is complex even in 
low level circuit descriptions, however there are techniques 
available which give best solutions, at least in theory if not 
in practice. 

There are also practices devised suitable for connecting 
large computers toget}ier to provide some measure of fault tol- 
erance. Even in this case, it is not clear what a good measure 
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of fault tolerance is. It is also necessary to specify what 
the faults are. I/T and permanent faults may be considered. 

The problem under investigation falls between the above two 
categories. How can microcomputers be best configured to pro- ’ 
vide fault tolerance? Microprocessors have peculiarities 
which must be considered in planning tolerance strategies. 

The relative price of each component changes what should be 
duplicated in the overall system. The capabilities of micro- 
computers are not as great as large computers/ and strategies 
developed for large computers are often not at all suitable 
for implementation with microprocessors, 

5, 3. 2, 2.1 Microprocessor Redundancy Schemes 

Redundancy schemes applied to microprocessors fall into 
two categories: those which are specifically designed for the 

microprocessor, and those which are general in nature and are 
originally intended as reliability schemes for larger processor 
systems, but are adapted to microprocessor based processors. 

Ideas based on larger systems which are later applied to micro- 
processor systems sometimes make little sense. There are some 
features of microprocessors which must be taken into account 
when devising tolerance schemes. The first is complexity. If 
the redundancy scheme uses so much extra hardware so as to over- 
shadow the amount of hardware that the microprocessor itself has, 
then the reliability of the hardware added for the extra relia- 
bility will probably be such that when compared to the relia- 
bility of the original nonredundant system, little will be 
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added, if indeed the so called reliable system is not less 
reliable than the simple system. Another feature which makes 
the microprocessor very different from larger processors is 
speed. The microprocessor is quite slow when compared with 
other large computers and minicomputers. Elaborate recon- 
figuration schemes implemented in software may take a long 
time to execute, and depending on the application, may or may 
not be suitable. 

The report from Ultrasystems [5-7] on reconfigurable 
computer systems is quite complete. Many of the ideas pre- 
sented there can be adapted to microprocessor designs. They 
categorize their approaches as mostly software, hardware aided 
software, and mostly hardware. The mostly hardware proposals 
involve a large amount of hardware, and would not be desirable 
to implement on a processor system using microprocessors as 
the processor elements, as the complexity of the extra hardware 
is large compared to the relative sm^ll amount of hardware which 
the individual microprocessors require. Reliability is closely 
connected with the amount of interconnections, and hardware 
designs involving large qua ri ties of integrated circuits de- 
manding many interconnections tend to become unreliable. There 
fore, any hardware added to microprocessors for I/T fault tol- 
erance should be small to moderate when compared with the com- 
plexity of the microprocessor itself. 

The major thrust of microprocessor based controllers is 
to replace hardware with software. Continuing in this manner, 
it would seem that any fault tolerant microprocessor system 
should use as little added hardware as is possible for the 
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added fault tolerance. The techniques mentioned in [5-7] 
under mostly software can be adapted reasonably well to 
microprocessor systems. This ranges from minimal additions 
to vast reconfiguration software monitors. The large moni- 
tors should be avoided with microprocessor implementations 
since microprocessors do not usually have a large amount of 
memory to hold elaborate programs, and very elaborate monitor 
programs would tend to take a long time to execute on micro- 
processor systems. Nevertheless, there are some very good 
techniques discussed in [5-7] , and those which can be used on 
microprocessor systems will be briefly outlined here. 

The applications program is broken into program segments. 
The choice of program segments can greatly affect the relia- 
bility of the end product. No more than one output statement 
should be in any one program segment, and large calculations 
should be broken down into several segments- A set of variables 
called the state vector is associated with each program segment. 
The state vector is such that in order to leave a particular 
segment with the correct data, all that should be needed is 
the state vector input to that segment. Naturally, the larger 
the program segment, the larger will be the state vector. If 
the program is operating properly, and if the state vector is 
correct, then the output of that program segment should be 
correct. That data can then be used as the state vector for 
the next program segment. Comparison of state vectors is the 
major reliability addition made in multiple processor Imple- 
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mentations in this approach. Multiple processors all execute 
the same program, and produce state vectors. Before a pro- 
gram segment is initiated, the state vectors of each processor 
are compared. If all agree, the processing continues in the 
normal manner. If a disagreement is found, one of several 
things will occur. If there are more than two processors, 
then program rollahead can be used. A vote is taken on the 
state vectors, and any processor which disagrees with the out- 
come of the vote has its state vector forcibly changed to what 
the ethers have. Program execution continues as normal- If 
there are only two processors or if there is a tie vote with 
an even nimiber of processors, rollback must be used. It is 
knoT-m that a mistake has occurred, but it is not known where. 
The previous state vector is reloaded, and program execution 
of the prior program segment is repeated. Rollback, of' course, 
takes longer than rollahead, and the larger the program segment, 
the longer the recovery time when rollback is used. 

Many other considerations come into play with multiple 
processor systems, such as keeping track of the frequency of 
errors in each module, knowing to remove a processor from the 
system, and trying to restart faulty processors at a slow rate. 
Implementations can be made on microprocessors with these 
techniques . 

Reliability schemes have been specifically developed for 
microprocessor systems, Wakerly [5-3] describes a triple 
modular redundancy system for microprocessors. He replicates 
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processor/memory pairs and adds voters. He discusses where 
the optimum place is to put the voters, and decides that 
the voters on the output of the memory is best. This system 
is simple, and the hardware automatically assures that the 
processors receive the proper data from memory. The biggest 
problem with this implementation is that it is possible for 
memory locations to be changed to bad data, so that periodically 
it is necessaiY to read and re\^rite the entire memory contents. 
This cleans up any errors, however, for larger memories it 
could take a considerably long period of time to execute. 

Sven this minimal arrangement requires a lot of circuitry: 24 

voters for an 8 bit machine. Reliability curves can be pro- 
vided for the various systems, 

5, 3, 2. 2, 2 Possible Use of Bit Slice Microprocessors for 
Tolerance 

Many microprocessor fault tolerance approaches utilize 
a modular structure. Processors are replicated and compari- 
sons are made between them. There is a lot of overhead in 
these designs for the limited amount of reliability gained, 
and an alternative approach is desired. An interesting 
possibility is the use of the bit slice microprocessor designs 
for this purpose. VJhaf would be significantly useful would 
be a tolerance structure which built a sixteen bit micro- 
processor out of five four bit slice microprocessors, leaving 
one extra for redundancy. This would be an overhead of only 
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25% as versus 200% for a triple modular redxindant system. 
However, the only part of the bit slice microprocessor design, 
which is actually modular in the slice sense is the register, 
arithmetic, and logic unit. The contemporary bit slice pro- 
cessor chips are powerful and include registers and shifters. 
These are useful for multiplication and similar powerful in- 
structions. These are sequential in nature, and this in it- 
self is a problem. To make a transparent redundant system 
from these devices with voters on the outputs of the RhLUs 
would require that the sequential portions of them not be 
used. This destroys most of their power, and is unreasonable. 
Even if this were not a problem, the RALU is a minor portion 
of the overall circuitry of which the bit slice microprocessor 
is composed, and it is not reasonable to make the HALU toler- 
ant while not doing anything to the rest of the circuitry to 
improve the fault tolerance. If the bit slice microprocessors 
included most of the slice properties throughout most of the 
circuitry, then perhaps good advantage could be made of them 
for fault tolerance implementations , No way is seen to do 
this with present bit slice microprocessors which is any batter 
than non bit slice microprocessors, and no way is evident to 
design a new type of bit slice microprocessor which would 
allow one to take advantage of the slice properties for fault 
tolerance implementations . 
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5. 3. 2. 2. 3 The Problem of Datemining -the Effectiveness 
of Designs 

There are many schemes proposed to achieve fault 
tolerant microcomputer systems, most incorporating multiple 
processors for the redundancy needed for the fault toler- 
ance. The present situation is such that short of construc- 
tion and operation in hostile environments, there is no good 
way to determine the relative effectiveness of the various 
approaches - indeed, even to verify that a particular imple- 
mentation will perform as claimed. The reason that there is 
so much difficulty in determining these parameters is that 
the fault class being considered is phenomenally large - 
namely, all intermittent/ transient faults. Parameters to be 
determined are such things as the sensitivity of the strategy 
to burst type faults, dependent faults, how long the recovery 
times are for different faults, the irngest expected and the 
mean recovery times, and catfv strophic faults. Very little is 
understood about the various categories of faults which can 
be utilized in analyzing fault tolerant approaches to micro- 
processor designs. VJhat is needed then is a general model 
for I/T faults. It is not clear, however, that a general 
model for such faults exists. Realizing this problem, the 
research emphasis has been shifted from fault tolerant strate- 
gies for microprocessors to that of measuring and modeling the 
intermittent/transient faults which can influence microprocessor 
based systems. 
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5 . 3 . 2 . 2 . 4 Plan for Resolving the Problem 

The work thus far has been directed towrads obtaining 
the means to further examine proposed multiple processor 
schemes. This is being accomplished somewhat experimentally. 

A microprocessor system has been constructed for this use.. 

The first step was to subject the processor to various in- 
duced faults. These induced faults are faults that are 
supposed to copy the real world intermittent/transient faults 
to which such a microprocessor system might .reasonably be 
expected to be exposed. Data which is to be collected from 
the Lear jet experiments to be conducted in Florida will be 
a more realistic guide in choosing realistic faults to induce. 
In fact, there is little distinction between actual fault 
situations and induced faults. The faults which a circuit is 
exposed to in normal operation are the faults of interest, 
but by the very nature of the fact that a study of those faults 
is being made, the circuit under test is not in normal opera- 
tion. This is particularly true in view of the fact that one 
cannot wait around for the natural faults to manifest them- 
selves, but must force the circuit into a faulty situation. 

Any faults to which a network is purposefully exposed are 
called induced faults. The induced faults are as close an 
approximation to real faults that the circuit would normally 
be exposed to as is possible, naturally, the fault rate will 
be higher in the induced faults than in a mildly hostile 
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environment, but this is necessary in order to accomplish 
our goals. 

The first step in the experimental procedure was to 
subject the test microprocessor to different induced fault 
situations. Such things as high or low power supplies, 
noisy power supplies, heat, and electromagnetic interference 
are possible induced faults which we have attempted to con- 
sider. Our choices will be soon coupled with the hear jet 
experiments which v/ill ultimately be the guide in choosing 
and implementing the induced faults. The experiments we are 
performing can be described as follows : It is not first of 

all known how these induced faults will affect the processor - 
these are decentralized faults, and the individual lines 
actually driven to faulty values are not known. A diagnostic 
program which will give data on the faults as they occur is 
rimning on the processor during the time that faults are be- 
ing induced. The purpose- is to collect enough data on each 
induced fault so that the data will be a signature of each 
fault, and give an indication of the severity of that fault. 
When enough data is collected on each induced fault, the 
fault emulation stage begins. This work is still in its 
initial stages, but, briefly, it characterizes fault emula- 
tion. 

■ s 

Induced faults will be approximated by emulated faults. 
Emulated faults are faults foi which it is known how, and 
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more importantly,' where they affect the circuit. For example, 
suppose a noisy power supply causes intermittent fault situa- 
tions to occur in the processor. The cause of the fault, is 
known, the power supply. However, it is not known where that 
fault is affecting the circuit to cause the failures. It 
could be internal to any of the integrated circuits, and may 
not even be directly observable on the pins of the packages. 

A logic level somewhere in the processor must be changed to 
cause the fc.ilure, but it is not known which level has been 
changed to the other, nor which line on which the level has 
been changed. Emulated faults will be postulated for each 
induced fault, and the same tests will be run with each emu- 
lated fault as was done with the induced faults. Comparison 
of the emulated fault data and the induced fault data will 
serve as a feedback loop to improve the accuracy of the 
postulated emulated faults. In this manner, a set of emu- 
lated faults will be constructed which in a sense are a 
model of the induced faults which are a close representation 
of the actual intermittent/ transient faults encountered in 
a real situation. The emulated faults for a specific in- 
duced fault may be used as a model of that induced fault 
because the direct effect of the emulated faults are kno\<m 
in the processor. This allows the entire system to be simu- 
lated on a large computer , and evaluations can be 'made of 
the effectiveness of the fault tolerance strategy. The 
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emulated faults do not have to be known deterministically, 
but may only be known statistically. The proposed way to 
generate these emulated faults is to have some intelligent 
device (minicomputer or specialized hardware) drive various 
fault injection networks which are imbedded throughout the 
microprocessor. This will give a large degree of freedom 
in arriving at emulated faults in the hope that a match can 
be obtained. 

The plan is to construct a microprocessor, expose it 
to various hostile environments, measure the effects, postu- 
late an equivalent emulated fault, expose it to that proposed 
emulated fault, measure the effects, and arrive at a reason- 
ably approximate class of emulated faults which can be used 
to model a large class of real hostile fault environments. 

The t;oncept of an emulated fault includes any faults which 
can be injected into the microprocessor in a manner such 
that its direct effects are known, either deterministically 
or statistically. These effects must be first order effects, 
meaning that it is clear that the particular fault emulation 
is directly causing some effect, and is not indirectly caused 
by that injection. 

On the other side are induced faults. These are an 
attempt to expose the processor to a real hostile environment 
without waiting for the processor to experience intermittent/ 
transient faults on its own. In order to test the validity 
of the postulated emulated faults , faults must be induced into 
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the processor which can be expected to closely parallel a 
real hostile environment. How these induced faults cause 
failures in the processor need not be kno^m - indeed, this 
is the entire point of the current research; in general 
this is not known. In short, the induced faults are being 
substituted for real world faults , and the emulated faults 
are effectively modeling the induced faults, though perhaps 
not in a conventional manner. The whole purpose of this 
procedure is to produce a methodology to test the effect*- 
tiveness of various fault tolerance microprocessor strategies. 

A suitable microprocessor for these proposed experiments 
has been constructed. Figure 5-2 shows the configuration. 

The processor has been constructed on plug boards, so that it 
may be easily modified. This allows any of the lines to be 
broken for the insertion of the fault injection networks. The 
processor is done, and the testing program is to be developed 
and checked. Figure 5-3 shows the .diagnostic program. When 
running, it periodically prints a message to indicate that it 
is still working. Implementation on the 8080 microprocessor 
has the advantage that if the stack gets changed, and a non- 
memory location is used for the stack, the processor will jump 
to a nonexistant location, and receive the data hexadecimal FF, 

which corresponds to the interrupt instruction on the 8080. 

\ 

Advantage is taken of this in the diagnostic program, and in 
normal operation the interrupt instruction is never reached. 
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A count is kept of the times an interrupt is received as an 
indication of the processor running awry. The checking pro- 
gram checks all of the microprocessor instructions, and 
prints a message if execution is improper, perhaps also with 
a time tag. It includes memory checks. 

The exact nature of the induced faults must be chosen, 
and suitable circuitry designed and built to create the 
faults. All of the data collecting must be done, and then 
the fault injection networks must be built for the emulations. 
Strategies for the emulations must be developed, and again 
data must be collected until the end goal is reached. Figure 
5-4 shows the proposed method for emulating faults. 

This plan will yield the needed information on intermittent/ 
transient faults as microprocessors are affected, and permit 
further investigation of microprocessor tolerance structures. 
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Figure 5-4; Fault Emulation Experiment 
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Appendix 


AN EFFICIENT FAULT DIAGNOSIS ALGORITHM FOR SYMMETRIC 


MULTIPLE PROCESSOR ARCHITECTURES 
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INTRODUCTION 


Consider a general model of a multiple processor architecture I 

. . - ' ' \\ 

consisting of n digital modtiles denoted Uq, ' »* ^n-1 some ^ 

associated interconnection design* denoted Dx 4 .. These modules, ‘W' 

0 j t ; . 

for example^ could be n processors implementing a segmented ;r; 

H 

algorithm [6].. Regardless of the use of the multiole processor H 

'j 

architectural wa V/ill assume that each is capable of testing the 
other Uj to ■Which it is directly connected for some specified cl: 

of faults. If a module contains any such fault we will refer to ■■ 
it as faulty* ^he problem we Will study in this paper is the ]\ 

diagnSsis Qf an existing fault Situation among the modules given 
their respecti'Ve testing results^. This problem is. not new and. * 

has been examined elsewhere in the literature *5^ 7,s!]- The 

results to be presented here represent a new approach to such 
diagnosis* In particular^ the diagnosis procedure described will 
be seen to be sufficiently straightforward to be easj.ly imple-* 
mentable on a simple processor, e,g., a microprocessor, and for a \i 
proper interconnection design among ths processors and upper bound ■ 
on the number of simultaneous faults which can occUh, will always m 
yield the correct diagnosis of the existing fault situation 



PHBLIMINABIES 
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Given n modules ^^-1* "wili denote the 

modules which tests = l 32 ,***jt, where 

fCr,!) 6 [0,1, * * -^n-l] , i = 0,l,**',n“l. For convenience, we 
will always assume that tests itself and, regardless ot its 
state, concludes that it is fault free. The outcome of the * 
test of module by module U. will be denoted aCi,f(r,i)) 

where ^ 

0, if concludes that i) fault free 

•a(i,f(r,i)) -I ‘ 


1 otherwise- 

It should be noted that the conclusion of,_ say, regarding tSie 
state (faulty or fault free) of the modules to which It is 
connected is only reliable if indeed is fault free. If with.' 
each module U^, we associate a test table i - 0,1,“**, n-1,. 
where represents the conclusion of regarding the states of 
all the modules, we have the problem of determining the existing 
fault situation based on the available test results' t-yhether 
or not this is feasible clearly depends on the number of faults 
and the interconnection design. "We will assume in the following 
that at most t modules can be simultaneously faulty and -that ^ 
every module is tested by at least -t other modules. Under some 
assumptions on the interconnection design, Preparata, Metze and 
Chien [73 have shown that it is feasible to diagnose any valid 
fault situation. However, the diagnosis algorithms which have 


been proposed to do so are quit e complex [1, ^,^3. We propose here 
a new diagnosis algorithm for. this problem. For the purpose of 
explanation We will assume in the following that the interconnec- 


tion design between the modules is the so-called 


design of [73 
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wher-ein there is a testing interconnection from U. to 13. if and 

• i t) 

only if J -r i = m (modulo n) and m assumes the values, from 1 to 
t, llie results presented here have been extended to more 
general interconnection designs, but since they are descriptively 
cumbersome., these extension will not be detailed. 


DIAGNOSIS ALGORITHM 
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Each test table B. has couiDonents B. B. ™ B. ^ 

i ' i>u i^n-l 

where ^ represents the conclusion of module regarding the 

state of module U, . If module U. ’’believes" that module U. is 

D 1 . • 1 

fault free, then B. - is set to the value 0, .otherwise B. .is 
set to the value 1. Suppose that Bq, complete 

in the sense that every module has a conclusion regarding the 
state of each of the modules i ~ 0^ 1, • * • ;,n“l , ¥e will 

assume here that if "a module is fault-free, its corresponding 
table is correct. 

Lemma 1 : There exists at least n-t of the. tables which 

are identical. 

Proof : Since at most t modules arc faulty, and since a table 

corresponding to a fault-free module correctly describes the 
fault situation, the theorem follows. 

Lemma 2 : If there exists only one set of identical "tables 

^i(l)’ ®iC2)^ * ’ *’’^i(s)’' ^ ^ n-t, then each of these 

tables in this set correctly describes the existing fault 
situation. 

Proof : ¥e already know that there exists at least n-t correct 

and therefore identical tables. Therefore, if only one set 
of identical tables has a cardinality larger or equal to n-t, 
this set must consist of the correct tahles. 

It should be clear that no conclusion can be' made regarding 
the fault situation if there exists more than one set of identi- 
cal tables with cardinality larger than or equal to n-t.. 
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TTieoreiu 1 c Suppose that n ^ 2t 4* X? then there exists one and on i 
one set of identical tables with cardinality larger than or i! 

equal to n-t. • | 

h' 

Proof : Suppose that n ^ 2t-rl and assume that there exist two ;i 

sets of identical tables of cardinality and ng respectively. 


Assume 


and 




We‘ know that 


^ n-t 
P -2 ^ n-t. 

^2 “ 2n — 2t. 

^ **■ ^^2 therefore 


n ^ 2n -2t. 

This inequality^, used In conjunction with n ^ 2t 4- 1 yields' 

2n > 2n 4- 1 

and we conclude that we cannot have two sets of identical tables 
of cardinality larger than or equal to n-t when n ^ 2t 4- 1. 

At this point V7S need an efficient procedure to build the 
complete n tables • * • , such that if module is 

fault free, then the table B. reflects accurately the fault 
situation of the multiple processor architecture. Such an 

■h 

algorithm is presented in the following to compute the tables 



* * *^ ^- 1 ” 

Algorithm 1; Let i in 10,1, . . . ^n-l] and t in Il,2, . , .,n-l3 be given. 
Step 0 : Set B- =0 for m=0, 1, • • • ,n-l, set j=i, set k=i4*l 

and set N-™=0. 

Step 1 : If Np > t, stop; else, go to Step 2. 

Step 2 : If k-i, stop; else, go to Step 3, i 

Step 3 : If = la set ^-1, set Nj,=Np4-l and go to • 

Step else, set j-k and go to Step 4, 

Step Set k-k4-l and go to Step 1. 
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Notesi 

(i) All additions are performed modulo n; , 

Cii) ¥e assume a D^. ^ interconnection design, f "^3? 

(modulo n); and therefore, vre use the notation aCJa^-tr) 
instead of the more general notation a(j,f(r,j)). 

Theorem 2 : If a ^ interconnection design is used, if the 

maximum number of faults vjhich may occur is t and if module 
is fault-free, then the table constructed by the algorithm 
accurately reflects* the existing fault situation. 

Proof : We need to show that the algorithm is well defined and 

that it produces tables V7hieh are correct whenever TJ^ is not 
faulty. The technique we use to prove the theorem is based on 
the use of invariant assertions as described in [2} (see Fig. 1). 

We assume that a D, ^ interconnection design is used, i.e., 
module tests the modules ^i+2^*''^ ^i+t*' algorithm 

uses the quantity a(5sk) which contains the result of the test of 
module k by module j . It follows that the algorithm is well 
defined if aind only if J and k are related by 

% 

k = j+r 

where r is some integer in [1,2, - - * ,t] . 

Assume that before executing Step 3, the following assertion holds: 
(Al) D-M < k < j+l+Np . 

Then it can be shown that (Al) still holds after the execution of 
Step 4. clearly (Al) is satisfied by the initial values given to 
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;5a k and Np and therefore vje conclude that (Al) Is alv;ajTs 
satisfied before the execution of Step 3- 

It is only possible to reach Step 3 ii’ 

Np < t 

It follows that ;just before the execution of Step 3> 
ties J and k are related by the assertion 
(A2) 3t- £ ^ 

which shows that the algorithm is well defined. 

The first part of the proof showed that the algorithm is 

well defined, l-fe now Drove that if D- is fault free, then the 

■ 1 ' 

table reflects the actual fault situation. Following again 
the approach described in [2]^ we show that the following asser- 
tions- are always satisfied before the execution of Step 3^ 

{A3) The module U. is not faulty 

J 

(A^) accurately reflects the existing fault situation up to 
k-li 3 for all m in Ciji-i-lji+ 23 ' * * ^k-l] : 


B. = 0 if and only if module U is' not faulLy- 
- — m 




and 


B. = 1 if and only if module U is faulty. 


(A5) Np contains the number of faulty modules up to^k-l^ 


k-1 

~ I 

m=i 


It can he shown that if (A3)j (a 4) and (A5) are true before the 
execution of Step 3^ then they are still true after the execution 
of Step 4, Clearly (A3), (a 4) and (A5) hold after the execution 
of Step 0 and therefore we conclude that (A3), (A4) and A (5) are 
always true before the execution of Step 3, 


Now, suppose that the algor ithSi stops in Stelp 1. We know rj 

that B. is correct uo to k and that NT-,=t. In other v/ords, B. _ ’ 

correctly reflects the fault situation for m=i, i+l,*’',k and t ' 

( 

faults have been detected. But we have assumed that at most t i 

faults may occur and therefore this implies that the remaining 

modules are not faulty- The B. for m = k+l, k+2, -,-,i— 1 are equa 

to 0 and therefore the complete table is correct. 

Suppose that the algorithm instead stops in Step 2; then 3^ 

is correct up to k=i and therefore is correct. 

Although we have shovm that when the algorithm stops, it 

produces the correct table. It remains to be shovm that it indeed 

stops after a finite number of iterations- ¥e note that k takes 
the values i, i-J-l, i+2, • • • and therefore if the algorithm does 

not stop in Step 1, it must necessarily stop in Step 2. This 

concludes the proof of the theorem. 

ACCBX/B RATED ALGORITHM 

The diagnosis of the set of faulty modules based on 

the results of Lemma 1, Lemma 2, and Theorem 1 requires that 

* 

the table , i = 0, 1, ,n-l be compared. This process is 

time consuming and may be avoided. For each j = 0,1,..,, n-1, 

let Y- lie the number of indices i for which B. =1, 

3 3.5 1 

Y- = cardinality of { i € [0, 1, - . -,n-l3 1 B’. - = 1 }, 

3 •‘•j J 

then, these quantities may be Used in a diagnostic algorithm 
as follows: 
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alqoritlim 2; Let- t in [l,n“*l3 be given. 

Step 0 * Compute tbs tables Bp-,. by using 

Algorithm 1. . 

Step 1 ; Compute the quantities Yqj ^l-’ ' * • ’ ^n-1 ' 

Step 2; Let v - t j 5 LOjl, ^n-l] 1 Yj > tel}. 

Theorem 3; If a D||^ ^ interconnect ion design is usedj if the 

maximum number of faults which may occur is t and if n > 2t -f Ij, 

then IT. is faulty if . and only if j is in V. , 

3 " • 

Proof ; The result is a direct consequence of Lemmas 1 and 2 
and Theorems 1 and 2. . . 

Algorithm 2 is well suited for implementation on a microprocessor 
For example, on an Intel 8080 microprocessor, the total amount 
of memo 2 :y necessary to store the data and the program in the case 
n =5 8 and t = 2 is 176 words of 8 bits, i,e>, 1408 bits. 

We note that Algorithm 2 may be implemented in parallel 
on a network of N microprocessors, with K <. n. in particular, 
if IST microprocessors are used, theii it is possible to -compute 
in parallel all the tables B. and all the quantities The 
computational time, necessary to diagnose the network of n mo- 
dules , using H'microprocessors for implementing Algorithm 2 
is essentially TEn/uB/n , where T is the computational time 
necessaiY execute the instructions of Algorithm 2 when a 
single micro ^'’.cocessor is used and [n/ul is the smallest inte- 
ger larger than n/N. 


..j:,... 


EXAMPLE 
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In order to demonstrate the simplicity of the algorithms^, we 

apply them to the network given in Figure 2. The network contains 

n=9 modules, t=3» most three modules may be faulty, and 

a _ interconnection design is used, i^e., module U„ tests 
JL,:S — — u 

Ul, U 2 and U^, module TJ^ tests U^, and U^, etc. Assume 
that. the modules U^,. and Ug are faulty. Figure 3 contains a 
possible set of test outcomes. The application of the algorithm 
to these test outcomes yields the tables i=0,l,2,"-,8 given 

in Figure ■4. We find that the tables Bq, B^, B^ and Bg 

are identical. VJe have 6 identical tables and using Lemma 2, 
we conclude that these tables reflect the correct fault situation 
of the network, ^-e. , we conclude that the modules 0^ and Ug 
are faulty. Alternatively, we may compute the quantities y^, 

Yg = 1, and then compute the set v = { j 1 y^ > 4 } Cl,3, 6} . 

Using Theorem 3, we conclude once again that u,j U-/ and U^- are 

. J- o 

faulty. 
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CONCLUSION . , 1^ 

u 

An approach ho the problem of fault diagnosis of s vrirtmetric ji 

' i 

multiple processor architectures has been proposed. It consists j 

of constructing tables ^ assuming that the corresponding 
laodules U. are not faultVj follov/ad by a voting orocedure. 

V , ii- 

The construction of . the tables is decoupled in the sense 

* 

. :5 

that each table may be constructed independently of the others, 

i [ 

It is possible to decrease the amount of computation necessary ‘f 

to obtain all the tables i = 0-^ n-1 by increasing the 

dependency between the construction of the ^-arious tables, it ^ 

is not difficult to 'find schemes in which the construction, of ■ 

the table depends on the tables U^, TJ_ , . - - j ^j_i ' Such 
schemes are more complicated to code than the one v;e propose. 

.require more memory to store the program and do not lend them- 
selves to parallel implementation. Therefore. v;e feel that 
our schernSj j Algorithm 2 which has a time complexity , of 

0 (n ) if sequentially implemented and 0(n) if implemented, on 
a network of n microprocessor, is^ ideally suited for' the fault 
diagnostic of ^networks, • ' 



\ 

ij 
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